Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Quality diversity scenario generation for human robot interaction
(USC Thesis Other)
Quality diversity scenario generation for human robot interaction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Quality Diversity Scenario Generation for Human Robot Interaction
by
Matthew C. Fontaine
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2025 Matthew C. Fontaine
Dedication
“This dissertation is dedicated to my mom, who inspired me to become a scientist.”
ii
Acknowledgements
The journey to this PhD has been a long an arduous one with many unexpected turns. Being accepted to a PhD program
took as long as many take to complete a PhD. I owe this opportunity to my teaching faculty mentors at UCF, Arup
Guha and Dr. Sean Szumlanski, that suggested I return to school for a PhD. I would like to thank my undergraduate
advisor Prof. Glenn A. Martin for his mentorship and first showing me the problem of scenario generation.
I am grateful for my UCF Programming Team family that help me learn how to explore and convey technical
and creative ideas in the context of competitive programming. I would also like to thank Prof. Brian C. Dean for his
mentorship during my time as a USACO coach and for his advice on navigating academia in general. I owe much of my
research creativity to both of these competitive programming experiences.
I thank Prof. Amy K. Hoover, Prof. Lisa B. Soros, and Prof. Julian Togelius for their help in becoming research
active again before applying to PhD programs. I would like to also thank Geeta Chaudhry, Prof. Deeparnab Chakrabarty,
and Prof. Thomas Cormen for their advice on applying to PhD programs, and I would like to thank Prof. Dinesh
Manocha for suggesting I apply to USC for robotics. My research has greatly benefited from my many collaborators at
USC including Prof. Bistra Dilkina, Prof. Heather Culbertson, and Prof. Gaurav Sukhatme.
I want to express my appreciation for all members of the ICAROS lab. Bryon Tjanaka helped broaden the impact of
my research through his development of the pyribs library. Varun Bhatt helped many times resolving driver issues for
GPUs and for his many contributions on the surrogate model papers. Sophie Hsu’s knowledge of human aware planners
was invaluable to our Overcooked work. Heramb Nemlekar helped make the robotics experiments possible. Yulun
Zhang’s did excellent work as a undergraduate researcher and has gone on to do amazing work as a PhD student at
Carnegie Melon University. I greatly appreciate all our late night discussions during paper deadlines.
iii
Finally, I would like to thank my advisor Prof. Stefanos Nikolaidis for relentlessly guiding the ICAROS lab
through the pandemic, being a supportive and enthusiastic advisor, and for our many discussions on quality diversity
optimization.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scenario Generation in Shared Autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Addressing QD Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Addressing Realism and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Addressing Evaluation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Shared Autonomy: A Case Study in Quality Diversity Scenario Generation . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Automatic Scenario Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Quality Diversity and MAP-Elites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 CMA-ES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Coverage-Driven Testing in HRI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Shared Autonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Shared Autonomy via Hindsight Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Scenario Generation with MAP-Elites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Scenario Generation with CMA-ME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Generating Scenarios in Shared Autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Scenario Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Assessment Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.3 Behavior Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.1 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2 Algorithm Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.4 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.6 Interpreting the Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Varying Grasp Poses of Goal Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
2.9 Comparing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10.1 Experimental Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.10.2 Stochasticity in Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10.3 Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10.4 Implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 3: Quality Diversity Algorithms as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Covariance Matrix Adaptation MAP-Elites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2.1 Quality Diversity (QD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2.2 MAP-Elites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2.3 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.3 Approach: The CMA-ME Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.3.1 CMA-ME Emitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.4 Toy Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.4.1 Distorted Behavior Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.5 Hearthstone Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.5.2 Search Parameters and Tuning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.5.3 Distributed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Differentiable Quality Diversity Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.5 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.8 Societal Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.9 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Soft Archives for Quality Diversity Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.4 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.5 Theoretical Properties of CMA-MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.7 On the Robustness of CMA-MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.8 Derivation of the Conversion Formula for the Archive Learning Rate . . . . . . . . . . . . . . 92
3.3.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.3.10 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vi
Chapter 4: Searching Generative Models of Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3 Mario Scene Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Chapter 5: Constraining Scenarios via Mixed Integer Programming Repair . . . . . . . . . . . . . . . . . . . 111
5.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 Deep Convolutional GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2 Mixed Integer Linear Program Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Edit Distance Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.6 End-to-End Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7 Beyond Zelda Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Generality and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 6: A General Framework for Searching Over Complex Environments . . . . . . . . . . . . . . . . . 126
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2.1 Human-Aware Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2.2 Procedural Content Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.3 Overcooked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.2 Deep Convolutional GAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3.3 Mixed-Integer Program Repair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3.3.1 Mixed-Integer Linear Program Formulation. . . . . . . . . . . . . . . . . . . . . . 133
6.3.3.2 Solvability Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.3.3 Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.3.4 MIP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.4 Latent Space Illumination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3.4.2 CMA-ME for Latent Space Illumination . . . . . . . . . . . . . . . . . . . . . . . 138
6.4 Planning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4.1 Human-Aware Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.1.1 Robot Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.1.2 Human Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5.1 Workload Distributions with Centralized Planning. . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.2 Workload Distributions with Human-Aware Planning . . . . . . . . . . . . . . . . . . . . . . 142
6.5.2.1 Maximizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5.2.2 Minimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5.3 Team Fluency with Human-Aware Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.6 Robustness to Human Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.7 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.7.1 Procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7.2 Participants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7.3 Hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
vii
6.7.4 Dependent Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7.5 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8.1 Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8.2 Implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Chapter 7: Sample Efficient Scenario Generation via Deep Surrogate Models . . . . . . . . . . . . . . . . . . 149
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.1 Quality diversity (QD) optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.2 QD for environment generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Quality diversity (QD) optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.2 Automatic environment generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.4 Deep Surrogate Assisted Generation of Environments (DSAGE) . . . . . . . . . . . . . . . . . . . . 153
7.4.1 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.4.2 Self-supervised prediction of ancillary agent behavior data. . . . . . . . . . . . . . . . . . . . 156
7.5 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5.1 Maze. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.2 Mario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.6.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.3.1 Inclusion of ancillary data prediction. . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.3.2 Method of selecting solutions from the surrogate archive. . . . . . . . . . . . . . . 161
7.6.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Chapter 8: Quality Diversity Scenario Generation for Complex Human-Robot Interaction . . . . . . . . . . . 164
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.1 Scenario Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.2 QD Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4 Surrogate Assisted Scenario Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.1 Surrogate Models for Human-Robot Interaction. . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.2 Scenario Repair via Mixed Integer Programming. . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.3 Objective Regularization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.4 DQD with Surrogate Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.5 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.5 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.5.1 Shared Control Teleoperation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.5.2 Shared Workspace Collaboration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.6.1 Real World Demo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.7.1 Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.7.2 Implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Chapter 9: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
viii
List of Tables
2.1 Results: Percentage of cells covered (coverage), percentage of cells covered that have maximum
assessment value (timeout) and QD-Score after 10,000 evaluations, averaged over 5 trials. . . . . . . 23
3.1 Sphere Function Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Rastrigin Function Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Hearthstone Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Mean QD-score and coverage values after 10,000 iterations for each algorithm per domain. . . . . . . 66
3.5 Mean QD-score and coverage values after 10,000 iterations for each QD algorithm per domain. . . . . 86
3.6 Mean QD-score and coverage values after 10,000 iterations for each DQD algorithm per domain. . . . 87
3.7 Mean QD metrics after 10,000 iterations for CMA-MAE at different learning rates. . . . . . . . . . . 91
4.1 Results: Average percentage of cells with fitness 1.0 (Valid / All), percentage of cells found (Coverage),
percentage of cells found with fitness 1.0 (Valid / Found), and QD-score after 10,000 evaluations. . . 107
5.1 The visualization for the tiles in Zelda. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Percentage of generated playable and unique levels with each technique. . . . . . . . . . . . . . . . . 118
6.1 Spearman’s rank-order correlation coefficients between the computed BCs and the initial placement of
environments in the archive for increasing levels of noise ϵ in human inputs. . . . . . . . . . . . . . . 145
7.1 Number of evaluations required to reach a QD-Score of 10480.8 in the Maze domain and 1306.11 in
the Mario domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Mean absolute error of the objective and measure predictions by the surrogate models. . . . . . . . . 160
ix
List of Figures
1.1 Example failure scenario in shared autonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 An example archive of solutions returned by the quality diversity algorithm MAP-Elites. The solutions
in red indicate scenarios where the robot fails to reach the desired user goal in a simulated shared
autonomy manipulation task. The scenarios vary with the environment (y-axis: distance between the
two candidate goals) and human inputs (x-axis: variation from optimal path). The axes units are in meters. 7
2.2 Single objective optimization algorithms aim to converge to an extreme point (e.g. a maximum or
minimum) for a given objective function (the arrow represents an objective function). Multi-objective
optimization algorithms search for extreme points across two or more competing objectives (each arrow
represents an objective). Multi-objective algorithms report Pareto-optimal solutions representing the
trade-off in optimizing the competing objective functions. Diversity-driven algorithms treat functions
as measures instead of objectives (each double arrow represents a measure function). The goal of
a diversity-driven algorithm is to find a solution for each combination of possible outputs from the
measure functions. QD algorithms allow for a single objective function and multiple measure functions
(the y-axis arrow represents an objective and the x-axis double arrow represents a measure). The goal is
to find a solution for each combination of outputs from the measure functions while also maximizing
the objective function for each combination. The blue shading shows the desired result for each class of
optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Example archives returned by the three algorithms for the three behavior spaces of Table 2.1. (Top)
BC1 & BC3 (Middle) BC1 & BC2, 2 goals (Bottom) BC1 & BC2, 3 goals. The colors of the cells in
the archives represent time to task completion in seconds. The axes units are in meters. . . . . . . . . 24
2.4 QD-Score over evaluations for each algorithm and behavior space of Table 2.1. . . . . . . . . . . . . 25
2.5 Coverage over evaluations for each algorithm and behavior space of Table 2.1. . . . . . . . . . . . . . 25
2.6 Distribution of cells explored for random search, MAP-Elites and CMA-ME. The cell colors represent
frequency counts. The axes units are in meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 (Left) The robot fails to reach the user’s goal gH because of the large deviation in human inputs from
the optimal path. The waypoints of the human inputs are indicated with green color. (Center) We show
for comparison how the robot would act if human deviation was 0 (optimal human). (Right) The robot
fails to reach the user’s goal gH (bottle furthest away from the robot), even though the human provides
a near optimal input trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Example failure scenario for n = 3 goals. Having two alternative goal objects instead of one increases
the probability of failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
x
2.9 (a) Archive generated with MAP-Elites for the horizontal distance and angular difference BCs. (b)
Execution trace of scenario annotated with a black circle in the archive. While the angle difference is
small, the robot fails to reach the user’s goal since it deviates from the optimal path while rotating. (c)
Execution trace of the same scenario if angles of both target grasps were set to 0. The robot succeeds
in reaching the user’s goal. (d) Execution trace after setting the grasp angle of the wrong goal to 0,
resulting in large angle difference between the two target grasps. The large angle difference helps with
inference early on and eventually guides the robot towards the correct goal. The y-axis units are in
meters and the x-axis units in radians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 Archives generated with MAP-Elites for the policy blending and hindsight optimization algorithms.
The colors of the cells in the archives represent time to task completion in seconds. The axes units are
in meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.11 Scenarios where the policy blending algorithm results in collision with an obstacle, approximated by a
sphere. (Left) While the human and robot trajectories are each collision-free, blending the two results
to collision when they point towards opposite sides of the obstacle. (Right) Blending with a very noisy
human input results in collision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 We reproduce the generated scenarios in the real world with actual joystick inputs. . . . . . . . . . . . 35
3.1 Comparing Hearthstone Archives. Sample archives for both MAP-Elites and CMA-ME from the
Hearthstone experiment. Our new method, CMA-ME, both fills more cells in behavior space and finds
higher quality policies to play Hearthstone than MAP-Elites. Each grid cell is an elite (high performing
policy) and the intensity value represent the win rate across 200 games against difficult opponents. . . 40
3.2 A Bates distribution demonstrating the narrowing property of behavior spaces formed by a linear
projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Improvement in QD-Score over evaluations for the Sphere Function n = 100. . . . . . . . . . . . . . 51
3.4 Improvement in QD-Score over evaluations for the Rastrigin Function n = 100. . . . . . . . . . . . . 51
3.5 The distribution of elites scaled by the number of occupied cells to show the relative makeup of elites
within each archives for the Sphere Function n = 100. . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 The distribution of elites scaled by the number of occupied cells to show the relative makeup of elites
within each archives for the Rastrigin Function n = 100. . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Hearthstone Results The distribution of elites by win rate. Each distribution is scaled by the number of
occupied cells to show the relative makeup of elites within each archive. . . . . . . . . . . . . . . . . 53
3.8 Hearthstone Results Improvement in QD-Score over evaluations. . . . . . . . . . . . . . . . . . . . . 54
3.9 Hearthstone Results Improvement in win rate over evaluations. . . . . . . . . . . . . . . . . . . . . . 54
3.10 An overview of the Covariance Matrix Adaptation MAP-Elites via a Gradient Arborescence (CMAMEGA) algorithm. The algorithm leverages a gradient arborescence to branch in objective-measure
space, while dynamically adapting the gradient steps to maximize a QD objective (Eq. 3.5). . . . . . . 59
3.11 QD-Score plot with 95% confidence intervals and heatmaps of generated archives by CMA-MEGA
(Adam) and the strongest derivative-free competitor for the linear projection sphere (top), arm repertoire
(middle), and latent space illumination (bottom) domains. . . . . . . . . . . . . . . . . . . . . . . . 68
xi
3.12 Result of latent space illumination for the objective prompt “Elon Musk with short hair.” and for the
measure prompts “A person with red hair.” and “A man with blue eyes.”. The axes values indicate the
score returned by the CLIP model, where lower score indicates a better match. . . . . . . . . . . . . . 69
3.13 An example of how different α values affect the function f − fA optimized by CMA-MAE after a fixed
number of iterations. Here f is a bimodal objective where mode X is harder to optimize than mode
Y , requiring more optimization steps, and modes X and Y are separated by measure m1. For α = 0,
the objective f is equivalent to f − fA, as fA remains constant. For larger values of α, CMA-MAE
discounts region Y in favor of prioritizing the optimization of region X. . . . . . . . . . . . . . . . . 74
3.14 Our proposed CMA-MAE algorithm smoothly blends between the behavior of CMA-ES and CMA-ME
via an archive learning rate α. Each heatmap visualizes an archive of solutions across a 2D measure
space, where the color of each cell represents the objective value of the solution. . . . . . . . . . . . . 77
3.15 QD-score plot with 95% confidence intervals and heatmaps of generated archives by CMA-MAE and
CMA-ME for the linear projection sphere (top), plateau (middle), and arm repertoire (bottom) domains.
Each heatmap visualizes an archive of solutions across a 2D measure space. . . . . . . . . . . . . . . 88
3.16 A latent space illumination collage for the objective “A photo of the face of Tom Cruise.” with hair
length and age measures sampled from a final CMA-MAEGA archive for the LSI (StyleGAN2) domain. 89
3.17 Final QD-score of each algorithm for 25 different archive resolutions. . . . . . . . . . . . . . . . . . 91
4.1 Mario scenes returned by the CMA-ME quality diversity algorithm, as they cover the designer-specified
space of two level mechanics: number of enemies and number of tiles above a given height. The color
shows the percentage of the level completed by an A* agent, with red indicating full completion. . . . 100
4.2 Ground truth scenes 1 (left) and 2 (right) for KL-divergence metric. . . . . . . . . . . . . . . . . . . 103
4.3 QD-Scores over time for each behavioral characteristic. . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Archive for the KL-divergence behavioral characteristic metric. . . . . . . . . . . . . . . . . . . . . . 107
4.5 Generated scenes using CMA-ME for small and large values of sky tiles and number of enemies. . . . 108
4.6 Playable scenes with minimum (left) and maximum (right) sum value (6) of the 8 binary agent-based BCs.108
4.7 Generated scenes for small and large values of KL-divergence to each of the two groundtruth scenes. . 109
5.1 An overview of our framework for generating aesthetically pleasing, playable game levels by using a
mixed-integer linear program to repair GAN-generated levels. . . . . . . . . . . . . . . . . . . . . . 112
5.2 Example generated Zelda levels of different techniques. The GAN+MIP framework repairs the
GAN-generated levels rendering them playable, while capturing the spatial relationships between tiles
exhibited in the human-authored levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 DCGAN network for learning the distribution of Zelda game levels. . . . . . . . . . . . . . . . . . . 115
5.4 The distribution of the average hamming (left) and edit (right) distance between levels from the same set.118
5.5 (a-c) Distribution of different game tiles for the human examples, the levels generated by the GAN and
the levels generated by the GAN and edited with the MIP solver. (d) Distribution of minimum paths
from key to door. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xii
5.6 Edit distance example. (Left) An unplayable generated level by the GAN network. (Center) The output
of the MIP solver that minimizes the Hamming distance to the input level. (Right) The output of the
MIP solver that minimizes the edit distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.7 GAN-generated Pac-Man level (left) and the same level repaired by the MIP solver (right). . . . . . . 122
6.1 An overview of the framework for procedurally generating environments that are stylistically similar to
human-authored environments. Our environment generation pipeline enables the efficient exploration
of the space of possible environments to procedurally discover environments that differ based on
provided metric functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Overcooked environment, with instructions for how to cook and deliver a soup. . . . . . . . . . . . . 131
6.3 Example Overcooked environments authored by different methods. The environments generated with
the GAN+MIP approach are solvable by the human-robot team, while having design similarity to the
human-authored environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Architecture of the GAN network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Human subtask state machine. The first element in the tuple is the object held by the simulated human;
the second element is the subtask the human aims to complete. . . . . . . . . . . . . . . . . . . . . . 139
6.6 Archive of environments with different workload distributions for the centralized planning agents and
four example environments corresponding to different cells in the archive. Environments (1,2) resulted
in uneven workload distributions, while environments (3,4) resulted in even workload distributions. We
annotate four environments from the archive. The bar shows the normalized value of the objective f. . 142
6.7 Archive of environments with different workload distributions of a QMDP planned robot and a
simulated myopic human. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.8 Archive of environments when attempting to minimize the performance of a QMDP robot (green agent)
and a simulated myopic human (blue agent). Lighter color indicates lower performance. (1a) and (1b)
show successive frame sequences for environment (1), and similarly (2a), (2b) for environment (2). . . 144
6.9 Archive of environments with different team fluency metrics. Environments (1) and (2) resulted in low
team fluency, while (3) and (4) resulted in high team fluency. . . . . . . . . . . . . . . . . . . . . . . 145
7.1 An overview of the Deep Surrogate Assisted Generation of Environments (DSAGE) algorithm. The
algorithm begins by generating and evaluating random environments to initialize the dataset and the
surrogate model (not shown in the figure). An archive of solutions is generated by exploiting a deep
surrogate model (blue arrows) with a QD optimizer, e.g., CMA-ME [91]. A subset of solutions from
this archive are chosen by downsampling and evaluated by generating the corresponding environment
and simulating an agent (red arrows). The surrogate model is then trained on the data from the
simulations (yellow arrows). While the images show Mario levels, the algorithm structure is similar
for mazes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2 QD-Score and archive coverage attained by baseline QD algorithms and DSAGE in the Maze and
Mario domains over 5 trials. Tables and plots show mean and standard error of the mean. . . . . . . . 158
7.3 Archive and levels generated by DSAGE in the Maze domain. The agent’s initial position is shown as
an orange triangle, while the goal is a green square. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4 Archive and levels generated by DSAGE in the Mario domain. Each level shows the path Mario takes,
starting on the left of the level and finishing on the right. . . . . . . . . . . . . . . . . . . . . . . . . 163
xiii
8.1 Example scenario in a collaborative package labeling task found by our proposed surrogate assisted
scenario generation framework. The presence of the two objects behind the robot results in its expected
cost-minimizing policy to move towards the object in the front, resulting in a conflict with the user who
is reaching the object at the same time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 An overview of our proposed differentiable surrogate assisted scenario generation (DSAS) algorithm
for HRI tasks. The algorithm runs in two stages: an inner loop to exploit a surrogate model of the
human and the robot behavior (red arrows) and an outer loop to evaluate candidate scenarios and add
them to a dataset (blue arrows). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 QD-Score attained in the three domains as a function of the number of evaluations (top) and the
wall-clock time (bottom). Algorithms with surrogate models have better sample efficiency but require
more wall-clock time per evaluation compared to other algorithms due to the overhead of model
evaluations and model training. Plots show the mean and standard error of the mean. . . . . . . . . . 172
8.4 Comparison of the final archive heatmaps in the shared control teleoperation domain. . . . . . . . . . 173
8.5 Example scenarios recreated with a real robot. The purple line shows the simulated human path. Videos
of the simulated and recreated scenarios are included in the supplemental material. . . . . . . . . . . 176
xiv
Abstract
As robots become more complex and start to enter our daily lives, the interaction between humans and robots will also
become more complex. Designers of robotic systems will start to struggle with anticipating how a robot will act in
different environments and with different users. The defacto standard for evaluating human-robot interaction has been
human subjects experiments. However, research labs can only run at most hundreds of trials when evaluating a new
system under typical time and resource constraints of research labs, limiting the variety of interaction scenarios each
experiment can cover. As a complement to human subjects experiments, this dissertation proposes algorithmically
generating scenarios to evaluate human-robot interaction systems, where a scenario consists of an environment and
simulated human. An environment consists of initial object locations in a scene and the initial configuration of a robot,
while we represent simulated humans as a parameterized agent that produces actions humans may take. The HRI field
evaluates algorithms as closed loop systems, where human behavior affects the robots actions and vice-versa. For
generality, my proposed systems treat the robotic system as a black-box and searches over the scenario parameters to
automatically find novel failures. By thoroughly evaluating such algorithms in simulation, researchers can scale to more
complex HRI algorithms, and industry will be able to better understand the capabilities of proposed HRI systems.
xv
Chapter 1
Introduction
1.1 Motivation
As robots become more complex and start to enter our daily lives, the interaction between humans and robots will also
become more complex. Designers of robotic systems will start to struggle with anticipating how a robot will act in
different environments and with different users. The defacto standard for evaluating human-robot interaction has been
human subjects experiments. However, research labs can only run at most hundreds of trials when evaluating a new
system under typical time and resource constraints of research labs, limiting the variety of interaction scenarios each
experiment can cover. As a complement to human subjects experiments, my work proposes algorithmically generating
scenarios to evaluate human-robot interaction systems, where a scenario consists of an environment and simulated
human. An environment consists of initial object locations in a scene and the initial configuration of a robot, while we
represent simulated humans as a parameterized agent that produces actions humans may take. The HRI field evaluates
algorithms as closed loop systems, where human behavior affects the robots actions and vice-versa. For generality, my
proposed systems treat the robotic system as a black-box and searches over the scenario parameters to automatically
find novel failures. By thoroughly evaluating such algorithms in simulation, researchers can scale to more complex HRI
algorithms, and industry will be able to better understand the capabilities of proposed HRI systems.
1
1.2 Scenario Generation in Shared Autonomy
To understand what constitutes a good scenario generation system in human-robot interaction, I first studied scenario
generation for shared autonomy systems [85, 90]. In shared autonomy, a human user desires to grasp one of several
goal objects, where a robotic system observes a human’s actions, infers the human’s intent, and assisted the human
in grasping the object. We selected the “policy blending” [69] and “hindsight optimization” [148] algorithms for
evaluation. We first considered framing the problem of scenario generation as a optimization problem of maximizing
failure, where the objective in shared autonomy was to maximize task completion time. However, the single-objective
optimization formulation has two limitations. First, the system results in poor behavioral coverage of the robotic system.
Second, the optimizer found trivial failures such as humans behaving erractically or hindering intent inference by
placing two objects close together.
Figure 1.1: Example failure scenario in shared autonomy.
To obtain better coverage, we instead framed the problem as a quality diversity (QD) problem [229, 38]. Like
single-objective optimization, QD consists of a single objective f : R
n → R that maps the scenario parameters to a
scalar to be maximized. However, QD also consists of k measure functions mi
: R
n → R, or as a vector function
m : R
n → R
k
. The functions form a mapping to the measure space S = m(R
n). A QD algorithm searches for
a collection of solutions diverse in S, but each maximize the objective f. In shared autonomy, we specify distance
between objects and observed human rationality as measures, the QD algorithm MAP-Elites [202] searches over
scenario parameters to produce a diverse collection of failure scenarios. MAP-Elites discovered expected failures, such
2
as the robot approaching the wrong goal due to erractic human actions. However, the system also discovered surprising
failures by placing two objects in a column and a nearly optimal user (Fig. 1.1).
This early scenario generation system highlighted several limitations that needed to be addressed in future systems.
(1) Being based on genetic algorithms, early QD algorithms were not efficient optimizers. (2) Searching directly over
scenario parameters fails to produce realistic scenarios. (3) Complex HRI scenarios have complex constraints that define
a valid scenario. (4) Evaluation of shared autonomy systems takes seconds, but more complex HRI settings require
minutes of evaluation. My following work targets addressing each of these issues to improve QD scenario generation.
1.3 Addressing QD Efficiency
While the MAP-Elites algorithm solves the coverage problem for scenario generation, the algorithm is non-adaptive
because it perturbs existing solutions with Gaussian noise. To improve efficiency of QD optimization, I proposed
combining CMA-ES [123] and MAP-Elites into a new algorithm CMA-ME [91]. By maintaining a full rank multivariate
Gaussian, CMA-ES estimates the natural gradient of the object f. We showed that CMA-ES could instead directly
optimize the natural gradient of the QD objective by maximizing improvement to the MAP-Elites archive and still
obtain the benefits of adaptive optimization. Overall, CMA-ME more than doubles the performance of MAP-Elites
across standard QD metrics.
Later, we noticed that all QD algorithms were based on derivative-free optimization techniques. To address this,
we formalized the differentiable quality diversity (DQD) problem [87], which assumes that f and mi are first-order
differentiable functions, and proposed the first DQD algorithms. We showed that DQD algorithms could efficiently
search the latent space of StyleGAN [158] for celebrities that vary in age or hair length guided by OpenAI’s CLIP
model [231].
1.4 Addressing Realism and Constraints
To scale to more complex scenarios, a valid scenario needs to satisfy complex constraints for validity. Imagine an HRI
task of cooking with a robot. Kitchen environments are filled with many key objects both the human and robot must be
able to reach for a cooking scenario to be valid. The generative space of all valid kitchen environments is also vast and
3
many possible configures do not resemble our human designed kitchens. For example, it would be odd to place the sink
on top of the refrigerator.
First, we looked at addressing the validity constraints of a scenario [291]. Prior work [251, 253] from procedural
content generation showed that game levels could be generated that satisfy formal constraints on playability and desired
characteristics. However, we need to be able to search scenario parameters directly based on feedback from the robotic
system. We proposed a generate-then-repair approach to producing game levels. Our work showed that constraints
could be specified by a mixed integer program (MIP). We modelled reachability constraints as a network flow problem
between key objects and specified an objective that minimized the edit distance between an invalid scenario and the
nearest valid scenario as a minimum cost flow problem.
To address the realism problem, we investigated generating scenarios that match human data. For example, given a
dataset of kitchen layouts, we learn a representation of realistic layouts as a generative model. Our work [89] showed
that the latent space of the generative model could be searched to produce human-like video game levels that maximize
agent performance, but vary in terms of gameplay behaviors.
Finally, we showed that both methods could be combined into one framework to evaluate human-robot coordination [88]. We evaluated our framework on the Overcooked game domain, a popular domain for testing AI coordination
systems.
1.5 Addressing Evaluation Time
Next, we look to address the limitation that complex HRI systems result in several minutes of interaction. For a scenario
generation system, this greatly increases the evaluation time required to assess a robotic systems performance on a
given scenario, slowing down QD optimization of scenarios.
Model-based Quality Diversity Optimization addresses expensive evaluation through surrogate models that model
the functions f and m. Early works [97] modelled f as a Gaussian process, where maximizing the acquisition function
greatly improved search speed. However, Gaussian processes do not scale to high-dimensional scenario parameters.
Our work [292] proposed modelling the functions f and m as a surrogate neural network. We observed that neural
networks require a diverse and high-quality dataset to train, and QD algorithms produced a diverse and high-quality
dataset. However, QD algorithms take many evaluations to get good results, but neural network evaluation is cheap
4
compared to simulation. Our key insight was to use QD algorithms to compensate for the weakness of neural networks
and vice-versa.
Our system would exploit the surrogate model with MAP-Elites to produce a dataset of solutions that the model
predicted to be diverse and high-quality. We then simulate the solutions to label the dataset with objective and measure
labels. Our labeled data allows the surrogate model to correct its errors. As the surrogate model becomes more accurate,
our deep surrogate assisted MAP-Elites produces diverse and high-quality solutions.
Next, we investigated scaling surrogate models to generate complex environments [21]. The objective and measures
for an environment need to predict agent performance and behavior given the initial conditions of the environment.
However, predicting these functions directly from initial conditions is difficult for a neural network. Our DSAGE
system proposed a self-supervised agent prediction, where we predicted the occupancy grid of the agents movement.
Each cell of the occupancy grid represented the probability the agent occupied that cell during evaluation. We evaluated
our system on Mario, where we tested the state-of-the-art planning agent [17], and a Maze domain, where we evaluated
the state-of-the-art ACCEL agent [220]. Our work revealed these agents frequently get stuck in infinite loops on certain
environments.
1.6 Conclusions and Future Work
My work proposes quality diversity scenario generation systems as a way to algorithmically generate diverse datasets of
failure scenarios to help evaluate HRI systems. I first studied this problem in shared autonomy on relatively simple
scenarios to help assess what would be needed in scenario generation systems. This identified several limitations,
which I address by improving the state-of-the-art in QD optimization, proposing MIP repair for scenarios, incorporating
generative models as learned scenario representations, and replacing expensive evaluation with surrogate models.
However, many of these components are separate, or have only been evaluated on a simplified grid-world domains.
Future work will scale QD scenario generation to more complex HRI systems that take several minutes to simulate.
Moreover, we will also incorporate DQD algorithms into the inner loop of surrogate assisted scenario generators as the
surrogate models used to predict agent behavior are end-to-end differentiable.
Finally, we need to show how QD scenario generation systems can effectively complement human subjects
experiments, to accelerate the progress in the research and development of HRI systems, resulting in more robust HRI.
5
Chapter 2
Shared Autonomy: A Case Study in Quality Diversity Scenario Generation
2.1 Introduction
We present a method for automatically generating human-robot interaction (HRI) scenarios in shared autonomy.
Consider as an example a manipulation task, where a user provides inputs to a robotic manipulator through a joystick,
guiding the robot towards a desired goal, e.g., grasping a bottle on the table. The robot does not know the goal of
the user in advance, but infers their desired goal in real-time by observing their inputs and assisting them by moving
autonomously towards that goal. Performance of the algorithm is assessed by how fast the robot reaches the goal.
However, different environments and human behaviors could cause the robot to fail, by picking the wrong object or
colliding with obstacles.
Typically, such algorithms are evaluated with human subject experiments [265]. While these experiments are
fundamental in exploring and evaluating human-robot interactions and they can lead to exciting and unpredictable
behaviors, they are often limited in the number of environments and human actions that they can cover. Testing
an algorithm in simulation with a diverse range of scenarios can improve understanding of the system, inform the
experimental setup of real-world studies, and help avoid potentially costly failures “in the wild.”
One approach is to simulate agent behaviors by repeatedly sampling from models of human behavior and interaction
protocols [260]. While this approach will show the expected behavior of the system given the pre-specified models, it
is unlikely to reveal failure cases that are not captured by the models or are in the tails of the sampling distribution.
Exhaustive search of human actions and environments is also computationally prohibitive given the continuous,
high-dimensional space of all possible environments and human action sequences.
6
Figure 2.1: An example archive of solutions returned by the quality diversity algorithm MAP-Elites. The solutions in
red indicate scenarios where the robot fails to reach the desired user goal in a simulated shared autonomy manipulation
task. The scenarios vary with the environment (y-axis: distance between the two candidate goals) and human inputs
(x-axis: variation from optimal path). The axes units are in meters.
Another approach is to formulate the problem as an optimization problem, where the goal is to find adversarial
environments and human behaviors. But we are typically not interested in the maximally adversarial scenario, which is
the single, global optimum of our optimization objective, since trivial failures are easy to find and unlikely to occur in
the real-world, e.g., the human moving the joystick of an assistive robotic arm consistently in the wrong direction.
Instead, we are interested in answering questions of the form, how noisy can the human input be before the algorithm
breaks? Or, in the aforementioned example task, how far apart do two candidate goals have to be for the robot to
disambiguate the human intent?
This chapter makes the following contributions:
1. We propose formulating the problem of generating human-robot interaction scenarios as a quality diversity (QD)
problem, where the goal is not to find a single, optimal solution, but a collection of high-quality solutions, in our case
failure cases of the tested algorithm, across a range of dimensions of interest, such as noise in human inputs and distance
between objects.
7
2. We adopt the QD algorithms CMA-ME [91] and MAP-Elites [53, 202] for the problem of scenario generation.
Focusing on the shared autonomy domain, where a robotic manipulator attempts to infer the user’s goal based on
their inputs, we show that MAP-Elites outperforms two baselines: standard Monte Carlo simulation (random search),
where we uniformly sample the scenario parameters, and CMA-ES [123], a state-of-the-art derivative-free optimization
algorithm, in finding diverse scenarios that minimize the performance of the tested algorithm. We select to test the
algorithm “shared autonomy with hindsight optimization” [148], since it has been widely used and we have found it to
perform robustly in a range of different environments and tasks. Additionally, in hindsight optimization inference and
planning are tightly coupled, which makes testing particularly challenging; simply testing each individual component is
not sufficient to reveal how the algorithm will perform.
3. We show that Monte Carlo simulation does not perform well because of behavior space distortion: sampling
directly from the space of environments and human actions covers only a small region in the space of dimensions of
interest (behavioral characteristics). For example, uniformly sampling object locations (scenario parameters) makes the
distribution of their distances (behavioral characteristic) concentrated in a small region near the mean of behavior space.
MAP-Elites focuses on exploring the space of the behavioral characteristics by retaining an archive of high-performing
solutions in that space and perturbing existing solutions with small isotropic Gaussian noise. Therefore, MAP-Elites
performs a type of simultaneous search guided by the behavioral characteristics, where solutions in the archive are used
to generate the next candidate solutions [202]. CMA-ME performs even better in heavily distorted spaces, because it
dynamically adapts the variance of perturbations applied to the scenario parameters.
4. We analyze the failure cases and we show that they result from specific aspects of the implementation of the
tested algorithm, rather than being artifacts of the simulation environment. We use the same approach to contrast the
performance of hindsight optimization with that of linear policy blending [69] and generate a diverse range of scenarios
that confirm previous theoretical findings [272]. The generated scenarios transfer to the real-world; we reproduce some
of automatically discovered scenarios on a real robot with human inputs. While some of the scenarios are expected, e.g.,
the robot approaches the wrong goal if the human provides very noisy inputs, others are surprising, e.g, the robot never
reaches the desired goal even for a nearly optimal user if the two objects are aligned in column formation in front of the
robot (Fig. 2.1).
8
QD algorithms treat the algorithm being tested as a “black box”, without any knowledge of its implementation, which
makes them applicable to multiple domains. Overall, this work shows the potential of QD to facilitate understanding of
complex HRI systems, opening up a number of scientific challenges and opportunities.
2.2 Background
2.2.1 Automatic Scenario Generation
Automatically generating scenarios is a long standing problem in human training [134], with the core challenge being
the generation of realistic scenarios [192]. Prior work [297] has shown that optimization methods can be applied to
generate scenarios by maximizing a scenario quality metric.
Scenario generation has been applied extensively to evaluating autonomous vehicles [9, 205, 2, 237, 102]. Contrary
to model-checking and formal methods [42, 214], which require a model describing the system’s performance such
as a finite-state machine [196] or process algebras [214], black-box approaches do not require access to a model.
Most relevant are black-box falsification methods [61, 294, 155] that attempt to find an input trace that minimizes
the performance of the tested system. Rather than searching for a single global optimum [61, 62], or attempting to
maximize coverage of the space of scenario parameters [294] or of performance boundary regions [205], we propose a
quality diversity approach where we optimize an archive formed by a set of behavioral characteristics, with a focus on
evaluating closed-loop shared autonomy systems. This allows us to simultaneously search for human-robot interaction
scenarios that minimize the performance of the system over a range of measurable criteria, e.g., over a range of variation
in human inputs and distance between goal objects.
Scenario generation is closely related to the problem of generating video game levels in procedural content
generation (PCG) [132, 246]. An approach gaining popularity is procedural content generation through quality diversity
(PCG-QD) [113], which leverages QD algorithms to drive the search for interesting and diverse content. Of relevance
is previous work [201] which procedurally generates objects of varying shape complexity and grasp difficulty via
QD algorithms. We note that [191] makes a distinction between procedural modelling systems [250] and scenario
generation systems. Their approach leveraged functional Lindenmayer systems [194] to generate complex terrains
(a procedural modelling system) and incorporated these techniques into a system capable of generating full training
9
Single Objective Multi Objective Diversity Driven Quality Diversity
Figure 2.2: Single objective optimization algorithms aim to converge to an extreme point (e.g. a maximum or minimum)
for a given objective function (the arrow represents an objective function). Multi-objective optimization algorithms
search for extreme points across two or more competing objectives (each arrow represents an objective). Multi-objective
algorithms report Pareto-optimal solutions representing the trade-off in optimizing the competing objective functions.
Diversity-driven algorithms treat functions as measures instead of objectives (each double arrow represents a measure
function). The goal of a diversity-driven algorithm is to find a solution for each combination of possible outputs from
the measure functions. QD algorithms allow for a single objective function and multiple measure functions (the y-axis
arrow represents an objective and the x-axis double arrow represents a measure). The goal is to find a solution for each
combination of outputs from the measure functions while also maximizing the objective function for each combination.
The blue shading shows the desired result for each class of optimization.
scenarios (a scenario generation system). Similarly, the object generation system [201] could be incorporated into a
scenario generation system that not only generates difficult to grasp objects, but also describes both the positions and
orientations of all objects in a scene. Our work differs by focusing on HRI applications, where in addition to static
objects in the environment, scenarios also include time-dependent human input sequences.
2.2.2 Quality Diversity and MAP-Elites.
QD algorithms differ from pure optimization methods, in that they do not attempt to find a single optimal solution, but
a collection of good solutions that differ across specified dimensions of interest. For example, QD algorithms have
generated video game levels of varying number of enemies or tile distributions [164, 89] and images of different classes
that fool neural network classifiers [208]. QD algorithms are frequently confused with multi-objective optimization
algorithms.
Fig. 2.2 clarifies the differences between each class of optimizer. We examine the shared autonomy domain as an
example, where the goal of the robot is to infer the user’s goal and assist the user in reaching that goal. If the objective
is maximizing scenario difficulty, single-objective optimization algorithms will find a single, maximally adversarial
scenario, e.g., by placing the objects very close to each other. Multi-objective optimization will attempt to find a Pareto
front of scenarios. For example, if we set as a second objective to minimize the distance between objects, all scenarios
that are challenging but where the objects are far from each other will be ignored for equally hard or harder scenarios
10
where the distance between objects is small. Diversity-driven algorithms replace the objective function with measure
functions and return solutions for each combination of measure outputs. If we consider distance between objects and
scenario difficulty as measures, a diversity-driven algorithm will attempt to find scenarios of varying difficulty and
distance, resulting in a large number of easy scenarios. Quality diversity algorithms, on the other hand, have both an
objective and measures. By maximizing scenario difficulty as an objective and distance between objects as a measure,
we will find scenarios that are challenging but for varying object distance values. More generally, quality diversity
allows us to focus the search on challenging scenarios that are diverse with respect to measures of interest.
MAP-Elites [202, 53] is a popular QD algorithm that searches along a set of explicitly defined measures called
behavior characteristics (BCs), which induce a Cartesian space called a behavior space. The behavior space is
tessellated into uniformly spaced grid cells. In each cell, the algorithm maintains the highest performing solution, which
is called as elite. The collection of elites returned by the algorithm forms an archive of solutions.
MAP-Elites populates the archive by first randomly sampling a population of solutions, and then selecting the elites
– which are the top performing solutions in each cell of the behavior space – at random and perturbing them with small
variations. The objective of the algorithm is two-fold: maximize the number of filled cells (coverage) and maximize the
quality of the elite in each cell.
By retaining an archive of high-performing solutions and perturbing existing solutions with small variations,
MAP-Elites simultaneously optimizes every region of the archive, using existing solutions as “stepping stones” to
find new solutions. Previous work has shown that solving each cell simultaneously with MAP-Elites variants [203] or
surrogate assisted quality diversity algorithms [97, 81] outperform independently solving single-objective constrained
optimization problems for each cell with CMA-ES.
MAP-Elites populates the archive by first randomly sampling a population of solutions, and then selecting the elites
– which are the top performing solutions in each cell of the behavior space – at random and perturbing them with small
variations. The objective of the algorithm is two-fold: maximize the number of filled cells (coverage) and maximize the
quality of the elite in each cell.
By retaining an archive of high-performing solutions and perturbing existing solutions with small variations,
MAP-Elites simultaneously optimizes every region of the archive, using existing solutions as “stepping stones” to
find new solutions. Previous work has shown that solving each cell simultaneously with MAP-Elites variants [203]
11
or surrogate assisted quality diversity algorithms [81] outperform independently solving single-objective constrained
optimization problems for each cell with CMA-ES.
Recent algorithms have built upon MAP-Elites to improve search efficiency, by exploring how the behavior space is
tessellated [252, 92], or leveraging gradient information when the objective function and behavior characteristics are firstorder differentiable [87]. Rather than sampling solutions from a fixed Gaussian, the Iso+LineDD operator [275] exploits
correlations between cells in the archive by sampling two archive solutions, then blending a Gaussian perturbation with
a noisy interpolation. We refer to MAP-Elites with an Iso+LineDD operator as MAP-Elites (line).
2.2.3 CMA-ES.
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a second-order derivative-free optimizer for singleobjective optimization of continuous spaces [123]. The algorithm belongs to a family of algorithms named evolution
strategies (ES), which specialize in optimizing continuous spaces. CMA-ES maintains a multi-variate Gaussian which
models the optimal search direction in high-dimensional space with uncertainty. CMA-ES updates this estimate by
sampling several directions from the multi-variate Gaussian during each optimization step and updating the Gaussian
distribution to maximize the likelihood of future successful search steps [31].
2.2.4 Coverage-Driven Testing in HRI.
Prior work [7, 8] explored test generation in human-robot interaction using Coverage-Driven Verification (CDV),
emulating techniques used in functional verification of hardware designs. Human action sequences were randomly
generated in advance and with a model-based generator which modeled the interaction with Probabilistic-Timed
Automata. Instead, we focus on online scenario generation by searching over a set of scenario parameters; the generator
itself is agnostic to the underlying HRI algorithm and human model. Previous work [8] has also used Q-learning to
generate plans for an agent in order to maximize coverage. Our focus is both on coverage and quality of generating
scenarios, with respect to a prespecified set of behavioral characteristics that we want to cover. In contrast to previous
studies that simulate human actions, we jointly search for environments and human/agent behaviors.
12
2.2.5 Shared Autonomy.
Shared autonomy (also: shared control, assistive teleoperation) combines human teleoperation of a robot with intelligent
robotic assistance. It has been applied in the control of robotic arms [147, 68, 209, 204, 133, 111, 143, 149, 188], the
flight of UAVs [234, 105, 174], and robot-assisted surgery [181, 235]. It has been used with a variety of interfaces, such
as whole body motions [68], natural language [67], laser pointers [276], brain-computer interfaces [204], body-machine
interfaces [145] and eye gaze [23, 147]. Shared autonomy first predicts the human’s goal, often through machine
learning methods trained from human demonstrations [130, 169, 285], or through maximum entropy inverse optimal
control [296]. Second, shared autonomy provides assistance, which often involves blending the user’s input with the
robot assistance to achieve the predicted goal [68, 76, 168]. Assistance can also provide task-dependent guidance [1],
manipulation of objects [149], or mode switches [133].
2.2.6 Shared Autonomy via Hindsight Optimization
In shared autonomy via hindsight optimization [148] assistance blends user input and robot control based on the
confidence of the robot’s goal prediction. The problem is formulated as a Partially Observable Markov Decision
Process (POMDP), wherein the user’s goal is a latent variable. The system models the user as an approximately optimal
stochastic controller, which provides inputs so that the robot reaches the goal as fast as possible. The system uses
the user’s inputs as observations to update a distribution over the user’s goal, and assists the user by minimizing the
expected cost to go – estimated using the distance to goal – for that distribution. Since solving a POMDP exactly is
intractable, the system uses the hindsight optimization (QMDP) approximation [186]. The system was shown to achieve
significant improvements in efficiency of manipulation tasks in an object-grasping task [148] and more recently in a
feeding task [147]. We refer to this algorithm simply as hindsight optimization.
2.3 Problem Statement
Given a shared autonomy system where a robot interacts with a human, our goal is to generate scenarios that minimize
the performance of the system, while ensuring that the generated scenarios cover a range of prespecified measures.
13
We let R be a single robot interacting with a single human H. We assume a function GH that generates human
inputs, a function GE that generates an environment and an HRI algorithm GR that generates actions for the robot. The
human input generator is parameterized by θ ∈ R
nθ
, where nθ the dimensionality of the parameter space, while the
environment generator is parameterized by ϕ ∈ R
nϕ . We define a scenario as the tuple (θ, ϕ).
In shared autonomy, GE(ϕ) generates an initial environment (and robot) state xE = GE(ϕ). The human observes
xE and provides inputs to the system uH = GH(xE, θ) through some type of interface. The robot observes xE and the
human input uH and takes an action uR = GR(xE, uH). The state changes with dynamics: x˙ E = h(xE, uR). H and R
interact for a time horizon T, or until they reach a final state xf ∈ XE.
To evaluate a scenario, we assume a function f(x
0..T
E , u0..T
R , u0..T
H ) → R that maps the state and action history to
a real number. We call this an assessment function, which measures the performance of the robotic system. We also
assume M user-defined functions, bi(x
0..T
E , u0..T
R , u0..T
H ) → R, i ∈ [M]. These functions measure aspects of generated
scenarios that should vary, e.g., noise in human inputs or distance between obstacles. We call these functions behavior
characteristics (BCs), which induce a Cartesian space called a behavior space.
Given the parameterization of the environment and human input generators, we can map a value assignment of the
parameters (θ, ϕ) to a state and action history (x
0..T
E , u0..T
R , u0..T
H ) and therefore to an assessment f(θ, ϕ) and a set of
BCs b(θ, ϕ). We assume that the behavior space is partitioned into N cells, which form an archive of scenarios, and we
let (θi
, ϕi) be the parameters of the scenario occupying cell i ∈ [N].
The objective of our scenario generator is to fill in as many cells of the archive as possible with scenarios of high
assessment f:
*
M(θ1, ϕ1, ..., θN , ϕN ) = maxX
N
i=1
f(θi
, ϕi) (2.1)
2.4 Scenario Generation with MAP-Elites
Algorithm 1 shows the MAP-Elites algorithm from [202, 53], adopted for the problem of scenario generation. The
algorithm takes as input a function GH parameterized by θ that generates human inputs, a function GE parameterized
*We note that the assessment function could be any performance metric of interest, such as time to completion or minimum robot’s distance to
obstacles. Additionally, while in this work we focus on minimizing performance, we could instead search for scenarios that maximize performance,
or that achieve performance that matches a desired value. We leave this for future work.
14
Algorithm 1 Scenario Generation with MAP-Elites
Input: Human input generator GH, environment generator GE, HRI algorithm GR, variations σθ, σϕ
Initialize: Scenarios in archive X ← ∅, assessments F ← ∅
for t = 1, . . . , N do
if t < Ninit then
Generate scenario θ, ϕ ← random generation()
else
Select elite θ
′
, ϕ′ ← random selection(X )
Sample θ ∼ N(θ
′
, σθ)
Sample ϕ ∼ N(ϕ
′
, σϕ)
end if
Instantiate Gθ
H ← GH(θ)
Instantiate G
ϕ
E ← GE(ϕ)
Simulate Gθ
H and GR on environment G
ϕ
E
Compute f ← assessment(Gθ
H, Gϕ
E
, GR)
Compute b ← behaviors(Gθ
H, Gϕ
E
, GR)
if F[b] = ∅ or F[b] < f then
Update archive X [b] ← (θ, ϕ), F[b] ← f
end if
end for
by ϕ that generates environments, and an HRI algorithm GR that generates actions for the robot. The algorithm searches
for scenarios θ, ϕ of high assessment values f that fill in the archive P.
For each scenario (θ, ϕ), MAP-Elites instantiates the generator functions GH and GE. For instance, ϕ could be a
vector of objects positions, and θ could be a vector of waypoints representing a trajectory of human inputs, or parameters
of a human policy.
After a scenario is generated, MAP-Elites executes the scenario in a simulation environment and computes the
assessment function f and the BCs b. MAP-Elites then updates the archive if (1) the cell corresponding to the BCs
X [b] is empty, or (2) the existing scenario (elite) in X [b] has a smaller assessment function (is of lower quality) than
the new candidate scenario. The archive update populates the archive to maximize coverage as well as to improve the
quality of existing scenarios. sec:experiments For the first Ninit iterations, MAP-Elites generates scenarios θ, ϕ by
randomly sampling from the parameter space. The random distribution of scenario parameters seed the initial archive
with scenarios. After the first Ninit iterations, MAP-Elites selects uniformly at random a scenario from the archive and
perturbs it with a small variation. The random variation operation better explores behavior space, compared to random
search, as we show in section 2.7.
We note that, while our experiments focus on the shared autonomy domain, the proposed scenario generation
method is general and can be applied to multiple HRI domains.
15
Algorithm 2 Scenario Generation with CMA-ME
Input: Human input generator GH, environment generator GE, HRI algorithm GR,
Initialize: Archive of scenarios X ← ∅, assessments F ← ∅
Initialize population of emitters E with initial Covariances Ce and means µe.
for t = 1, . . . , N do
Select emitter e from E
Generate scenario θ, ϕ ← generate(µe, Ce)
Instantiate Gθ
H ← GH(θ)
Instantiate G
ϕ
E ← GE(ϕ)
Simulate Gθ
H and GR on environment G
ϕ
E
f ← assessment(Gθ
H, Gϕ
E
, GR)
b ← behaviors(Gθ
H, Gϕ
E
, GR)
emitter update(e, θ, ϕ, f, b)
if F[b] = ∅ or F[b] < f then
Update archive X [b] ← (θ, ϕ), F[b] ← f
end if
end for
2.5 Scenario Generation with CMA-ME
Covariance Matrix Adaptation MAP-Elites (CMA-ME) [91] is a recently proposed hybrid algorithm, which incorporates
the self-adaptation techniques of CMA-ES into MAP-Elites. CMA-ME maintains a set of individual CMA-ES-like
instances, named emitters. Each emitter focuses on exploring a different area of the behavior space, by maintaining a
multivariate Gaussian which estimates profitable search directions. Here we use a particular type of emitter, called an
improvement emitter which was shown to outperform MAP-Elites in distorted behavior spaces by finding strategies for
the game Hearthstone [91] and by searching the latent space of a generative adversarial network [89]. Improvement
emitters estimate the search direction that maximizes the expected local improvement to the MAP-Elites archive. By
adopting the adaptation mechanisms of CMA-ES, CMA-ME finds areas of behavior space that are hard to discover
without adjusting the sampling distribution.
Algorithms 2, 3 show the CMA-ME algorithm from [91], adapted to the problem of scenario generation. Algorithm 2
shows the high-level scheduling mechanism of CMA-ME. Emitters are selected in a round-robin fashion. Each emitter
generates a scenario by sampling from a Gaussian distribution N (µe, Ce). The generator functions are then instantiated
in the same way as in MAP-Elites. The human and robot are simulated in the environment, and we evaluate the
simulation to assessment f and BCs b. These metrics, together with the scenario, are passed to the emitter update
function that is used to update the parameters of the emitter.
16
Algorithm 3 emitter update
Input: emitter e, scenario (θ, ϕ), assessment function f, behaviors b,
Unpack the parents, sampling mean µe, covariance matrix Ce, and parameter set Pe from e.
if X [b] = ∅ then
∆ ← f
Flag that (θ, ϕ) discovered a new cell
Add (θ, ϕ) to parents
else
if F[b] < f then
∆ ← f − F[b]
Add (θ, ϕ) to parents
end if
end if
if sampled population is size λ then
Sort parents by their ∆ values (prioritizing parents that discovered new cells)
Update µe, Ce, Pe by parents
parents ← ∅
end if
Algorithm 3 shows the implementation of the emitter update function for an improvement emitter. The function first
collects all scenarios that either discover new cells in the archive, or improve existing cells, in a set of scenarios named
parents. Once the number of sampled solutions reaches λ, the improvement emitter sorts the parents by prioritizing
parents that have discovered new cells, thus encouraging exploration of the archive. The emitter then updates its mean
and covariance, as well as a parameter set Pe that contains additional CMA-ES related parameters, such as evolution
path (see [91] for more details) by the ranking of parents.
2.6 Generating Scenarios in Shared Autonomy
We focus on a shared autonomy manipulation task, where a human user teleoperates a robotic manipulator through a
joystick interface. The robot runs a shared autonomy algorithm, which uses the user’s input to infer the object the user
wants the robot to grasp, and assists the user by moving autonomously towards that goal.
2.6.1 Scenario Parameters
Following the specification of Sec. 2.3, we define a human input generator GH parameterized by θ and an environment
generator GE parameterized by ϕ.
17
Environment Generator: The environment generator GE takes as input the 2D positions gi of n goal objects (bottles),
so that ϕ = (g
1
, ..., gn), and places them on top of a table. We specify the range of the coordinates gx ∈ [0, 0.25]
(in meters), gy ∈ [0, 0.2] so that the goals are always reachable by the robot’s end-effector. We position the robotic
manipulator to face the objects (Fig. 2.7).
Human Input Generator: Our pilot studies with the shared autonomy system have shown that user inputs are typically
not of constant magnitude. Instead, inputs spike when users wish to “correct” the robot’s path, and decrease in magnitude
afterwards when the robot takes over.
Therefore, we specify the human input generator GH, so that it generates a set of equidistant waypoints in Cartesian
space forming a straight line that starts from the initial position of the robot’s end-effector and ends at the desired goal
of the user. At each timestep, the generator takes as input the current state of the robot (and the environnment) xE, and
provides a translational velocity command uH for the robot’s end-effector towards the next waypoint, proportional to
the distance to that waypoint.
We allow for noise in the waypoints by adding a disturbance d ∈ [−0.05, 0.05] for each of the intermediate
waypoints in the horizontal direction (x-axis in Fig. 2.7). We selected m = 5 intermediate waypoints, and specified the
human input parameter θ as a vector of disturbances, so that θ = (d1, ..., d5). We note that this is only one way, out of
many, of simulating the human inputs.
HRI Algorithm and Simulation Environment: We use the publicly available implementation of the hindsight
optimization algorithm [172, 171], which runs on the OpenRAVE [65] simulation environment. Experiments were
conducted with a Gen2 Lightweight manipulator. For each goal object we assume one target grasp location, on the side
of the object that is facing the robot.
2.6.2 Assessment Function.
The assessment function f represents the quality of a scenario. We evaluate a scenario by simulating it until the robot’s
end-effector reaches the user’s goal, or when the maximum time (10 s) has elapsed. We use as assessment function time
to completion, where longer times represent higher scenario quality, since we wish to discover scenarios that minimize
performance.
18
2.6.3 Behavior Characteristics.
We wish to generate scenarios that show the limits of the shared autonomy system: how noisy can the human be
without the system failing to reach the desired goal? How does distance between candidate goals affect the system’s
performance? Intuitively, noisier human inputs and smaller distances between goals would make the inference of the
user’s goal harder and thus make the system more likely to fail.
These measures are the behavioral characteristics (BC) b: attributes that we wish to obtain coverage for. We explore
the following BCs:
Distance Between Goals: How far apart the human goal is from other candidate goals in a scenario plays an important
role in disambiguating the human intent when the robot runs the hindsight optimization algorithm. The reason is
that the implementation of the algorithm models the human user as minimizing a cost function proportional to the
distance between the robot and the desired goal. The framework then infers the user’s goal by using the user inputs
as observations; the more unambiguous the user input, the more accurate the inference of the system. Therefore, we
expect that the further away the human goal object gH is from the nearest goal gN , the better the system will perform.
We define this BC as:
BC1 = ||gH − gN ||2 (2.2)
Given the range of the goal coordinates, the range of this BC is [0, 0.32]. In practice, there will be always a minimum
distance between two goal objects because of collisions, but this does not affect our search, since we can ignore cases
where the objects collide. We partitioned this behavior space to 25 uniformly spaced intervals.
Human Variation: We expect noise in the human inputs to affect the robot’s inference of the user’s goal and thus the
system’s performance. We capture variation from the optimal path using the root sum of the squares of the disturbances
di applied to the m intermediate waypoints.
BC2 =
vuutXm
i=1
d
2
i
(2.3)
A value of 0 indicates a straight line to the goal. Since we have di ∈ [−0.05, 0.05] (Sec. 2.6.1), the range of this BC is
[0, 0.11]. We partitioned this behavior space to 100 uniformly spaced intervals.
19
Human Rationality: If we interpret the user’s actions using a bounded rationality model [13, 83], we can explain
deviations from the optimal trajectory of human inputs as a result of how “rational” or “irratonal” the user is.†
Formally, we let xR be the 3D position of the robot’s end-effector and uH be the velocity controlled by the user in
Cartesian space. We model the user as following Boltzmann policy πH 7→ P(uH|xR, gH, β), where β is the rationality
coefficient – also interpreted as the expertise [149]– of the user and QgH is the value function from xR to the goal gH.
P(uH|xR, gH, β) ∝ e
−βQgH (xR,uH)
(2.4)
Let QgH = −||uH||2 − ||xR + uH − gH||2 [83], so that the user minimizes the distance to the goal. Observe that if
β → ∞, the human is rational, providing velocities exactly in the direction to the goal. If β → 0, the human is random,
choosing actions uniformly.
We can estimate the user’s rationality, given their inputs, with Bayesian inference [83]:
P(β|xR, gH, uH) ∝ P(uH|xR, gH, β)P(β) (2.5)
Since the human inputs change at each waypoint (Section 2.6.1), we perform m + 1 updates, at the starting position
and at each intermediate waypoint, on a finite set of discrete values β. Following previous work [149], we set the
rationality range β ∈ [0, 1000]. We then choose as behavioral characteristic the value with the maximum a posteriori
probability at the end of the task:
BC3 = argmax P(β|x
0..T
R , gH, u0..T
H ) (2.6)
We partitioned the space to 101 uniformly spaced intervals.‡
†We note that we use the human rationality model as one way, out of many, to interpret human actions and not to generate actions. Human
actions can be generated with any generator model. In this paper, we use the deterministic model described in Sec. 2.6.1. We discuss extensions to
stochastic human models in Section 2.10.
‡Both BC2 and BC3 measure how noisy the human is. However, BC3 interprets the noise in human inputs under the assumption of a bounded
rationality model.
20
2.7 Experiments
We compare different search algorithms in their ability to find diverse and high-quality scenarios in different behavior
spaces.
2.7.1 Independent Variables
The experiment has two independent variables, the behavior space and the search algorithm.
Behavior Space: (1) Distance between n = 2 goal objects (BC1) and human rationality (BC3), (2) Distance between
n = 2 goal objects (BC1) and human variation (BC2), and (3) Distance between human goal and nearest goal for n = 3
goals (BC1) and human variation (BC3).§
Search Algorithm: We evaluate five different search methods: CMA-ME, MAP-Elites, MAP-Elites (line), CMA-ES
and random search. In random search we use Monte Carlo simulation where scenario parameters are sampled uniformly
within their prespecified ranges. We implemented a multi-processing system on an AMD Ryzen Threadripper 64-core
(128 threads) processor, as a master search node and multiple worker nodes running separate OpenRAVE processes in
parallel, which enables simultaneous evaluation of many scenarios. Random search, MAP-Elites and MAP-Elites (line)
run asynchronously on the master search node, while CMA-ES and CMA-ME synchronize before each covariance
matrix update. We generated 10,000 scenarios per trial, and ran 60 trials, 5 for each algorithm and behavior space. One
trial parallelized into 100 threads lasted approximately 20 minutes.
2.7.2 Algorithm Tuning
MAP-Elites first samples uniformly the space of scenario parameters θ, ϕ within their prespecified ranges (Sec. 2.6.1)
for an initial population of 100 scenarios. The algorithm then randomly perturbs the elites (scenarios from the archive)
with Gaussian noise scaled by a σ parameter. The two scenario parameters, position of goal objects ϕ and human
waypoints θ, are on different scales, thus we specified different σ for each: σϕ = 0.01, σθ = 0.005.
MAP-Elites (line) has two parameters, σ
1
for the isotropic distribution and σ
2
for the directional distribution. We
set σ
1
ϕ = 0.01 σ
1
θ = 0.005, σ
2
ϕ = 0.1 σ
2
θ = 0.05. We let the size of the initial population be 100 scenarios, same as
MAP-Elites.
§We note that the behavior spaces can be more than two-dimensional, e.g, we could specify a space with all three BCs. We include only 2D
spaces since they are easier to visualize and inspect.
21
To generate the scenarios for random search, we uniformly sample scenario parameters within their prespecified
ranges, a method identical to generating the initial population of MAP-Elites.
For CMA-ES, we selected a population of λ = 12 following the recommended setting from [123]. To encourage
exploration, we used the bi-population variant of CMA-ES with restart rules [11, 122], where the population doubles
after each restart, and we selected a large step size, σ = 0.05. Since the two search parameters are in different
scales, we initialized the diagonal elements of the covariance matrix C, so that cii = 1.0, i ∈ [2n] and cii = 0.5, i ∈
{2n + 1, ..., 2n + m}, with 2n and m the dimensionality of the goal object and human input parameter spaces
respectively.
A single run of CMA-ME deploys 5 improvement emitters, with a population λ = 37. We set σ = 0.01 identical
to MAP-Elites. Because the two search parameters are in different scales, we initialized the diagonal elements of the
covariance matrix identically to CMA-ES.
Both CMA-ES, MAP-Elites and CMA-ME may sample scenario parameters that do not fall inside their prespecified
ranges. Following recent empirical results on bound constraint handling [22], we adopted a resampling strategy, where
new scenarios are resampled until they fall within the prespecified range.
2.7.3 Performance Metrics
We wish to measure both the diversity and the quality of scenarios returned by each algorithm. These arethere combined
by the QD-Score metric [229], which is defined as the sum of f values of all elites in the archive (Eq. 2.1 in Sec. 2.3).
Similarly to previous work [91], we compute the QD-Score of CMA-ES and random search for comparison purposes
by calculating the behavioral characteristics for each scenario and populating a pseudo-archive. As an additional metric
of diversity we compute the coverage, that is the number of occupied cells in the archive divided by the total number of
cells. We also include a “timeout” metric, indicating the number of occupied cells that have the maximum assessment
(maximum time allowed to reach the target object, equal to 10 s) divided by the total number of cells.
2.7.4 Hypothesis
H1. We hypothesize that the QD algorithms CMA-ME, MAP-Elites and MAP-Elites (line) will result in larger QD-Score
and coverage than both CMA-ES and random search.
22
sec:experiments
BC1 & BC3, 2 goals BC1 & BC2, 2 goals BC1 & BC2, 3 goals
Algorithm Coverage Timeout QD-Score Coverage Timeout QD-Score Coverage Timeout QD-score
Random 22.3% 5.8% 3464 48.4% 14.9% 7782 41.9% 19.7% 7586
CMA-ES 24.8% 12.0% 4540 38.9% 20.1% 7422 34.5% 24.1% 7265
MAP-Elites 62.8% 18.5% 10128 63.0% 26.2% 11216 57.4% 32.6% 11204
MAP-Elites (line) 59.2% 17.8% 9540 64.0% 22.8% 10772 56.4% 32.4% 11048
CMA-ME 75.5% 18.3% 11768 70.0% 21.3% 11454 64.5% 32.3% 12141
Table 2.1: Results: Percentage of cells covered (coverage), percentage of cells covered that have maximum assessment
value (timeout) and QD-Score after 10,000 evaluations, averaged over 5 trials.
Previous work [92, 91] has shown that behavior spaces are typically distorted: uniformly sampling the search
parameter space results in samples concentrated in small areas of the behavior space. Therefore, we expect random
search to have small coverage of the behavior space. Additionally, since random search ignores the assessment function
f, we expect the quality of the found scenarios in the archive to be low.
CMA-ES moves with a single large population that has global optimization properties. Therefore, we expect it
to concentrate in regions of high-quality scenarios, rather than explore the archive. On the other hand, MAP-Elites,
MAP-Elites (line) and CMA-ME both expand the archive and maximize the quality of the scenarios within each cell.
H2. We hypothesize that the CMA-ME will result in larger QD-Score and coverage than MAP-Elites and MAP-Elites
(line). Previous work has shown that CMA-ME achieved significantly better performance than both MAP-Elites and
MAP-Elites (line) in standard quality diversity benchmarks and the strategic game Hearthstone [91], as well as in
searching the latent space of a Generative Adversarial Network to automatically generate video game levels [91]. While
MAP-Elites perturbs solutions by sampling from a Gaussian with a fixed variance, CMA-ME adopts the dynamic
adaptation properties of CMA-ES and dynamically changes the variance of each scenario parameter by ranking a
population of previously sampled scenarios. We expect that this will result in CMA-ME exploring areas of the behavior
space that are harder to reach when sampling from a fixed Gaussian as in MAP-Elites, or when sampling with the
variation operator of MAP-Elites (line).
2.7.5 Analysis
H1. Table 2.1 summarizes the performance of the four algorithms, for each of the three behavior space. We conducted a
two-way ANOVA to examine the effect of the behavior space and the search algorithm on the QD-Score and coverage.
There was a statistically significant interaction between the search algorithm and the behavior space for both QD-Score
(F(8, 60) = 26.52, p < 0.001) and coverage (F(8, 60) = 46.06, p < 0.001). Simple main effects analysis with
23
0.32
Cl)
u
�
cc
..j...J
•1""""4
CMA-ME MAP-Elites
II
1000 0
I
1000 0
Random Search
I I I
II
II I I I
I 1
1 I I
I I I I I
I
CMA-ES
I
I I
I
·
.
1
I I
1000 0 1000
10
9
8
-7
-6
-5
4
3
2
0.32 CMA-ME MAP-Elites Random Search CMA-ES
10
I 9
I I
8
a)
u I -7 � I (tj
..j....:) -6 I
•..--t -5 Q I
I
I 4
I
I I
I 2 0
0 Variation 0.11 0 Variation 0.11 0 Variation 0.11 0 Variation 0.11
0.32 CMA-ME MAP-Elites Random Search CMA-ES
10
9
Cl.)
I 8
u -7
-6
•l""'"I I
-5 Q I
4
I 3
I 2 0
0 Variation 0.11 0 Variation 0.11 0 Variation 0.11 0 Variation 0.11
Figure 2.3: Example archives returned by the three algorithms for the three behavior spaces of Table 2.1. (Top) BC1 &
BC3 (Middle) BC1 & BC2, 2 goals (Bottom) BC1 & BC2, 3 goals. The colors of the cells in the archives represent
time to task completion in seconds. The axes units are in meters.
Bonferroni correction showed that MAP-Elites, MAP-Elites (line) and CMA-ME outperformed CMA-ES and random
search in both QD-Score and coverage (p < 0.001 in all comparisons). This result supports hypothesis H1.
Fig. 2.3 shows example archives from each algorithm, while Fig. 2.4 and Fig. 2.5 show the evolution of QD-Score
and coverage for each algorithm and behavior space. MAP-Elites, MAP-Elites (line), and CMA-ME visibly find more
cells and of higher quality (red color), illustrating their ability to cover larger areas of the archive with high-quality
scenarios. As expected, CMA-ES concentrates in regions of high-quality scenarios but has small coverage. We observe
that MAP-Elites (line) performs similarly to MAP-Elites in this domain.
Random search covers a smaller area of the map, compared to MAP-Elites and CMA-ME, because of the behavior
space distortion (highly nonuniform scenario distribution), shown in Fig. 2.6. Even though the search parameters
are sampled uniformly, scenarios are concentrated on the left side of the archive specified by the human rationality
and distance between goals BCs (Fig. 2.6-top). This occurs because if any of the sampled waypoints deviates from
the optimal pasec:experimentsth, low values of rationality become more likely. In the human variation and distance
24
Figure 2.4: QD-Score over evaluations for each algorithm and behavior space of Table 2.1.
Figure 2.5: Coverage over evaluations for each algorithm and behavior space of Table 2.1.
between goals BCs, the distribution of scenarios generated by random search is concentrated in a small region near
the center (Fig. 2.6-bottom). This is expected, since the two BCs are Euclidean norms of random vectors [277]. On
the other hand, MAP-Elites selects elite scenarios from the archive and perturbs them with Gaussian noise, instead of
uniformly sampling the scenario parameters, resulting in larger coverage.
Fig. 2.3 shows that all the algorithms explore a smaller part of the archive for the case of n = 3 goals, compared to
n = 2 goals. This is because we specify as BC for n = 3 the distance between the user’s goal and the nearest goal
(Eq. 2.2). When n = 3, larger distance values require both alternative goals to be sampled far from the user’s goal,
making the corresponding regions in the archive harder to explore.
H2. We compare the performance of CMA-ME against MAP-Elites and MAP-Elites (line): simple main effects
analysis with Bonferroni corrections showed that CMA-ME achieved better QD-Score than MAP-Elites (p = 0.001)
and MAP-Elites (line)(p = 0.01) in the first behavior space (BC1 & BC3). CMA-ME achieved better coverage than
both algorithms (p < 0.001) in the first behavior space. The differences between the two algorithms were not significant
for the QD-Score and coverage in the second and third behavior spaces.
25
sec:experiments
+
+
Random Search
Random Search
Figure 2.6: Distribution of cells explored for random search, MAP-Elites and CMA-ME. The cell colors represent
frequency counts. The axes units are in meters.
We observe that CMA-ME is particularly effective in the first behavior space, which is heavily distorted. As
CMA-ME adapts the sampling distributions of it’s emitters, the algorithm can shrink the variance of perturbations of
individual scenario parameters to efficiently find and optimize scenarios across the spectrum of the rationality measure.
As MAP-Elites uses fixed isotropic Gaussian noise, the algorithm cannot shrink variances when trajectory disturbances
approach maximum rationality nor expand the variance to find human behaviors of intermediate rationality (Fig. 2.6).
Improvement emitters form the estimate of profitable search directions based on how the archive changes during
updates. However, the archive quickly fills with scenarios that “timeout” by thereachieving a maximum assessment
value (red cells in Fig. 2.3) with low rationality scores. The lack of change in these regions leads improvement emitters
to direct their search to the edges of the behavior space, where it succeeds it finding scenarios with longer execution
times. This results in spending less compute searching in the red regions, compared to MAP-Elites, which explains the
difference in the timeout values between the two algorithms. Running CMA-ME with a combination of improvement
emitters and optimizing emitters (see [91]) could balance the trade-off between covering the archive and improving
existing solutions.
26
x
y
z
gH gH gH
Figure 2.7: (Left) The robot fails to reach the user’s goal gH because of the large deviation in human inputs from the
optimal path. The waypoints of the human inputs are indicated with green color. (Center) We show for comparison
how the robot would act if human deviation was 0 (optimal human). (Right) The robot fails to reach the user’s goal gH
(bottle furthest away from the robot), even though the human provides a near optimal input trajectory.
Observe that, while the differences between the two algorithms in the metrics were not significant for the second and third behavior spaces at the 10,000 evaluation mark, CMA-ME makes faster progress than MAP-Elites
and MAP-Elites (line)(Fig. 2.4, 2.5). Running the algorithms for more iterations could increase their difference in
performance, and we plan to explore this further in the future.
Overall, our analysis shows that CMA-ME is particularly effective in finding hard to reach cells in the archive, and
achieves better performance than MAP-Elites and MAP-Elites (line) in the first behavior space.
2.7.6 Interpreting the Archives
In the generated archives (Fig. sec:experiments2.3), each cell contains an elite, which is the scenario that achieved the
maximum assessment value (time to complete the task) for that cell. We can therefore replay the elites in different
regions of the archives to explain the system’s performance. We focus on the first two behavior spaces using the archives
generated with MAP-Elites in Fig. 2.3.
We observe that if the distance between goals is large and the human is nearly optimal, the robot performs the task
efficiently. This is shown by the blue color in the top-right of the first behavior space (distance and human rationality
β). We observe the same for large distance and small variation in the second behavior space.
We then explore different types of scenarios where the robot fails to reach the user’s goal by the maximum time
(10s), indicated by the red cells in the archives. When human variation is large (or equivalently rationality is low),
the human may provide inputs that guide the robot towards the wrong goal. Since the robot updates a probability
distribution over goals based on the user’s input [148] and the robot assumes that the user minimizes their distance to
their desired goal, noisy inputs may result in assigning a higher probability to the wrong goal and the robot will move
27
gH
Figure 2.8: Example failure scenario for n = 3 goals. Having two alternative goal objects instead of one increases the
probability of failure.
towards that goal instead. Fig. 2.7(left) shows the execution trace of one elite where this occurs. Fig. 2.1 shows the
position of this elite in the archive. Fig. 2.7(center) shows how the robot would reach the desired goal if the human had
behaved optimally, instead.
What is surprising, however, is that the robot does not reach the user’s goal even in parts of the behavior space
where human variation is nearly 0 (or equivalently rationality is very high), that is when the human provides a nearly
optimal input trajectory! Fig. 2.7(right) reveals a case, where the two goal objects are aligned one closely behind the
other. The robot approaches the first object, on the way towards the second object, and stops there.
This is more prevalent in the case of three goals. This is expected, since when there are two alternative goal objects
sampled uniformly, there is a higher probability of at least one of the objects being close to the robot’s path (Fig. 2.8).
This explains the larger ratio of failure scenarios in the archives for 3 goals, compared to the archives for 2 goals.
What is interesting in both scenarios is that the robot gets “stuck” at the wrong goal, even when the simulated
user continues providing inputs to their desired goal! Inspection of the publicly available implementation [172] of the
algorithm shows that this results from the combination of two factors: the robot’s cost function and the human inputs.
Cost Function. The cost function that the robot minimizes is specified as a constant cost when the robot is far away
from the goal and as a smaller linear cost when the robot is near the target [147] (distance to target is smaller than a
threshold). This makes the cost of the goal object near the robot significantly lower than the cost of the other goal
objects, which results in the probability mass of the goal prediction to concentrate on that goal. While this can help
28
the user align the end-effector with the object [147], it can also lead to incorrect inference, if the robot approaches the
wrong goal on its way towards the correct goal or because of noisy human input. We confirmed that removing the linear
term from the cost function results in the robot reaching the right goal in both examples.
Human Inputs. The hindsight optimization implementation minimizes a cost function specified as the sum of two
quadratic terms, the expected cost-to-go to assist for a distribution over goals, and a term penalizing disagreement with
the user’s input. The first-order approximation of the value function leads to an interpretation of the final robot action
uR = u
A
R + u
u
R as the sum of two independent velocity commands, an “autonomous” action u
A
R towards the distribution
over goals and an action that follows the user input u
u
R, as if the user was directly teleoperating the robot [147].
We have simulated the human inputs, so that they provide a translational velocity command towards the next
waypoint, proportional to the distance of the robot’s end-effector to that waypoint (Sec. 2.6.1). This results in a term
u
u
R of small magnitude when the end-effector is close to one of the waypoints. If at the same time the robot has high
confidence on one of the goals, u
A
R will point in the direction of that goal and it will cancel out any term u
u
R that
attempts to move the robot in the opposite direction.
We confirmed that, if the user instead applied the maximum possible input towards their desired goal, the robot
would get “unstuck,” so a real user would always be able to eventually reach their desired goal. However, this requires
effort from the user who would need to “fight” the robot.
Overall, the archive reveals limitations that depend on how the goal objects are aligned in the environment, the
direction and magnitude of user inputs, and the cost function used by the implementation of the hindsight optimization
algorithm.
2.8 Varying Grasp Poses of Goal Objects
In Sec. 2.6, we have assumed that all goal objects have a target (grasp) pose with the same orientation, facing the robot.
However, in the general case, objects have different grasp poses, which are typically precomputed using the shapes of
the object and the robot’s end-effector. For example, the robot would need to align its end-effector with the handle of a
coffee mug in order to grasp it successfully, and different mugs in the environment may have different rotations. We
extend our search of failure scenarios to account for different grasp rotations.
29
(a)
x
y
z
gH
(b)
.
gH
(c)
.
gH
(d)
Figure 2.9: (a) Archive generated with MAP-Elites for the horizontal distance and angular difference BCs. (b) Execution
trace of scenario annotated with a black circle in the archive. While the angle difference is small, the robot fails to reach
the user’s goal since it deviates from the optimal path while rotating. (c) Execution trace of the same scenario if angles
of both target grasps were set to 0. The robot succeeds in reaching the user’s goal. (d) Execution trace after setting the
grasp angle of the wrong goal to 0, resulting in large angle difference between the two target grasps. The large angle
difference helps with inference early on and eventually guides the robot towards the correct goal. The y-axis units are in
meters and the x-axis units in radians.
Experiment. We search for two target poses, one for each goal object. We parameterize a target pose in SE(2) using
two variables, gx, gy for translation and a variable gω for rotation about the z-axis (Fig. 2.9b), i.e., a pose is a vector in
R
3
. We set the range of the scenario parameters ϕ = (gx, gy, gω), gx ∈ [0, 0.15], gy ∈ [0, 0.1], ω ∈ [0, −π/3], so that
all target poses within the range are always reachable by the robot.
When the goal objects have target poses with different rotations, the shared autonomy implementation uses both the
user’s translational and rotational inputs to infer their goal [147]. When implemented on a real robot, rotational inputs
are applied in the form of angular velocity commands, e.g., by switching modes in a joystick interface [133, 147], or
through motion game controllers [74].
Therefore, in addition to the translational velocity commands between the human waypoints provided by the
simulated user, we add an angular velocity command about the z-axis proportional to the angular distance between
the pose of the robot’s end-effector and the grasp pose of the user’s goal. To isolate the effect of the angular velocity
30
commands, we did not apply any disturbances to the waypoints, so that the simulated user was always optimal. This
resulted in a 2n + 1 scenario parameter space.
Sec. 2.6 showed that the horizontal distance between the two objects affects the system’s performance, since the
robot may get “stuck” if it gets close to the wrong goal. We wish to explore whether the angular difference between the
target poses of the goal objects has a similar effect: if difference is large, we expect the angular velocity commands
provided by the user to disambiguate the two goals and facilitate the system’s inference. Therefore, we specified as BCs
both the horizontal distance (x-axis, with range [0, 0.15]) and the angular difference between the targets of the two goal
objects, with range [0, π/3]. The resolution of the archive was 15 × 60.
Analysis. Fig. 2.9a shows the archive from one run of MAP-Elites for 10,000 evaluations. We observe that while
horizontal distance plays an important role in performance as we anticipated, illustrated by the blue color of the cells
when distance is large, there is not a similar effect with the angular difference between the two grasps. Replaying the
scenarios in the archive shows that both the angular difference between the two grasp poses and the absolute value of
the angles played a role. This is illustrated in the execution trace in Fig. 2.9b, representing the annotated scenario in the
archive of Fig. 2.9a.
In this scenario the angular difference is very small, equal to 0.05 (rad), while the absolute angle value of the target
pose is -0.87 (rad), requiring the robot’s end-effector to rotate significantly from its starting angle of 0. However, we
observe that rotating the end-effector results in the robot deviating from the optimal path; this is caused by approximation
errors in computing the joint velocities from the angular velocities, where the computation is done by formulating the
problem as a quadratic optimization problem with box constraints [173]. This results in the angular velocity commands
having a similar effect to noise in the human inputs, and the system makes incorrect inference of the user’s goal. We
verified that if the target rotations matched the initial rotation of the robot’s end-effector, so that the simulated user
provided only linear velocity commands, the robot successfully reached the user’s goal (Fig. 2.9c). Finally, we set the
grasp angle of the wrong goal to 0, making the angle difference between the two objects equal to 0.87. The large angle
difference enables the robot to infer the user’s goal early on from the human rotational inputs and stay close to the
optimal path, thus reaching the desired goal (Fig. 2.9d). ¶
¶We note that the scenarios in Fig. 2.9c, 2.9d are not from the archive. Instead, we created each scenario manually to show the effect of grasp
orientation on the performance of the shared autonomy system.
31
Overall, we observe that if the grasp targets have a large angular difference, rotational inputs by the user towards
one of the targets may facilitate the robot’s inference. However, this effect is mitigated by errors in approximating the
joint velocities from the angular velocity commands, resulting in displacements from the optimal path.
2.9 Comparing Algorithms
Given the effectiveness of quality diversity algorithms in automatically generating a diverse range of test scenarios, we
propose QD as a method to contrast the performance of different algorithms. We continue using the shared autonomy
domain as example, and we compare hindsight optimization [148] against policy blending [68].
Policy Blending. In shared autonomy, policy blending is a widely used alternative to POMDP-based methods in shared
autonomy, since it provides an intuitive interface for combining robotic assistance with user inputs [34, 182, 111, 204,
68, 143]. In this approach, the robot’s and user’s actions are treated as two independent sources. The final robot motion
is determined by an arbitration function that regulates the two inputs, based on the robot’s confidence about the user’s
goal. A special case of policy blending is linear blending. Theoretical analysis of linear blending [272] has shown that
it can lead to unsafe behavior in the presence of obstacles.
We want to empirically confirm these theoretical findings and assess how hindsight optimization and linear policy
blending perform in the presence of obstacles. We thus created a new environment with one goal object and a second
object that acts as an obstacle. Since there is only one unique goal object, there will be no failure cases arising from
incorrect inference as in Sec. 2.6.
In linear blending, robot and human inputs are weighted by the arbitration coefficients α and 1 − α respectively.
Previous work [68] considers two types: an “aggressive” blending, where the robot takes over (α = 1.0) if confidence is
above a threshold (conf ≥ 0.4), and a “timid” blending where α linearly increases from 0 to 0.6 for 0.4 ≤ conf ≤ 0.8.
When we have only one goal, confidence is always 1.0 and “aggressive” blending and hindsight optimization are
identical. Therefore, we consider only the “timid” mode which sets α to 0.6.
Obstacle Modeling. While the hindsight optimization implementation [172] models the user as minimizing the
Euclidean distance between the robot’s end-effector and their goal, this is no longer applicable in the presence of
obstacles, since a direct path to the goal may collide with the obstacle. Exact computation of the value function is
also infeasible, given that the state and action spaces are continuous and the robot acts in real-time. We simplify the
32
(a) Policy Blending (b) Hindsight Optimization
Figure 2.10: Archives generated with MAP-Elites for the policy blending and hindsight optimization algorithms. The
colors of the cells in the archives represent time to task completion in seconds. The axes units are in meters.
computation by modeling the obstacle as a sphere and, if a direct path to the goal intersects with the sphere, we find the
intersecting points and approximate the value function as the length of the shortest path to the goal that wraps around
the sphere. We do the same for the human waypoints (before adding any disturbances), so that an optimal human
will follow the shortest collision-free path to the goal. For robustness, if the end-effector touches the surface of the
sphere, it receives an additional velocity command in the direction away from the center of the sphere, similarly to
a potential field [48]. We found the sphere approximation simplification adequate for detecting collisions with the
hindsight optimization and policy blending algorithms.
Experiment. We fix the position of the goal object and the obstacle in the y-axis, so that the obstacle o is always
between the robot and the goal, and we search for the coordinates (ϕ = (gx, ox)) in the x-axis (Fig. 2.11). Identically to
Sec. 2.6, we search for the disturbances to the human inputs (θ = (d1, ..., d5)). We specify three BCs: the horizontal
distance between the two objects in the x-axis, the human variation – identically to Sec. 2.6 – and a binary BC,
BCcollision ∈ {T rue, F alse} indicating whether the robot’s end-effector has collided with the obstacle. As before,
we set the scenario assessment function f to the time to completion, with a maximum time limit of 15s. The task
terminates if the robot reaches the goal or if it collides with the obstacle. We generate 20,000 scenarios with one run of
MAP-Elites to test the hindsight optimization and policy blending algorithms.
Analysis. Fig. 2.10 shows the archives generated for the policy blending and the hindsight optimization algorithms. For
each algorithm, we present two 2D archives, one with BCcollision = T rue and one with BCcollision = F alse. We
compare the archives of the two algorithms in terms of coverage and quality of the scenarios.
Coverage. The coverage was 47% for hindsight optimization, compared to 69% for policy blending. We observe
that the archive with BCcollision = T rue is heavily populated in policy blending (Fig. 2.10a). The archive shows
collisions even for a nearly optimal human; while both the human and robot inputs were collision free, sometimes they
33
x
y
z
Figure 2.11: Scenarios where the policy blending algorithm results in collision with an obstacle, approximated by a
sphere. (Left) While the human and robot trajectories are each collision-free, blending the two results to collision when
they point towards opposite sides of the obstacle. (Right) Blending with a very noisy human input results in collision.
pointed towards opposite sides of the obstacle, and blending the two resulted in collision (Fig. 2.11(left)). This result
matches the theoretical predictions [272] of unsafe trajectories in linear blending.
The BCcollision = T rue archive in Fig. 2.10a also shows that when the horizontal distance exceeds a threshold,
the robot can reach the goal with a nearly straight path, and collision occurs only when the human inputs are very noisy
(Fig. 2.11(right)).
On the other hand, hindsight optimization resulted in a nearly empty archive when BCcollision = T rue. The reason
is that it uses the human inputs as observations, and the robot’s motion is determined only by the robot’s policy; the few
sparse collisions are artifacts of the OpenRAVE environment, occurring because of infrequent lags in sending velocity
commands to the robot.
Scenario Assessment. When BCcollision = F alse, the average time to completion was 7.95s for policy blending
and 6.70s for hindsight optimization, indicating that policy blending took more time to finish the task. This result
matches previous human subjects experiments [147], where policy blending took longer than hindsight optimization,
and it was caused by two factors: (1) Often the human and robot inputs would point to opposite directions, resulting in
velocities of small magnitudes when blended together. (2) The human inputs were proportional to the distance to the
next waypoint, spiking at the waypoints and decreasing until the next waypoint (Sec. 2.6.1). This resulted in parts of
the trajectory where the human inputs were of smaller magnitude than the robot’s, so blending the two gave smaller
velocity commands than treating the human inputs as observations. The red bars in hindsight optimization (Fig. 2.10b)
indicate timeouts; when the goal object was exactly behind the obstacle, the robot sometimes oscillated between taking
a grasping position and moving away from the sphere.
34
Figure 2.12: We reproduce the generated scenarios in the real world with actual joystick inputs.
2.10 Discussion
2.10.1 Experimental Findings.
We found that failure scenarios for hindsight optimization occur when the two goals are close to each other and the
human inputs are noisy, or when one goal is in front of the other. In the latter case, failure occurs even if the human
input is nearly optimal in minimizing the distance to the desired goal. The orientation of the grasp pose of the goal
objects also plays a role: if the robot’s end-effector needs to rthereotate significantly, this can cause deviations from
the optimal path, even when the human angular velocity commands are optimal. In all these cases, the robot becomes
over-confident about the wrong goal and gets “stuck” there.
An important factor is the linear decrease of the cost in the vicinity of the goal objects. When specifying the cost
function, it would be prudent to make the distance threshold for the linear decrease proportional to the distance between
the goal objects, rather than setting it to an absolute value.
Other potential measures to avoid the system’s overconfidence towards the wrong goal are: (1) including the
Shannon entropy with respect to all the goals in the cost function [149] to penalize actions that result in very high
confidence to one specific goal; (2) assigning a non-zero probability that the user changes their mind throughout the
task and switches goals [209, 144]. It would be interesting to investigate the effect of more “conservative” assistance on
subjective and objective metrics of the robot’s performance.
35
Finally, while linear policy blending naturally gives more control to the user and it is preferred by users in simple
tasks [147], we empirically verified that it can generate unsafe trajectories, even if the individual human and robot
inputs are safe.
To show that the presented scenarios can occur in deployed systems, we reproduce them in the real world with actual
inputs through a joystick interface (Fig. 2.12).|| We reproduced the scenarios from simulation by visually inspecting a
replay of the simulation scenario and providing similar commands manually through a joystick interface. This shows
that the scenarios are not artifacts of the simulation environment as the scenarios can easily be replicated in the real
world.
2.10.2 Stochasticity in Scenarios.
In our experiments the generated scenarios are deterministic. One may wish to simulate scenarios where there is
stochasticity in the robot’s actions, e.g., when the robot’s policy is stochastic, or in the environment dynamics, e.g.,
when there is uncertainty in action outcomes. A designer may also wish to test the system’s performance under a
stochastic human model, e.g., when human inputs are generated by a stochastic noisily rational human.
The most common approach in evolutionary optimization of noisy domains is explicit averaging, where we run
multiple trials of the same scenario and then retain an aggregate measure of the assessment estimate, e.g, we compute
the average to estimate the expected assessment E[f(θ, ϕ)] [233, 152]. We can follow the same process to estimate the
behavior characteristics [154]. To improve the efficiency of the estimation, previous work has also employed implicit
averaging methods, e.g., where the assessment of a scenario (θ, ϕ) is estimated by taking the assessments of previously
evaluated scenarios in the neighborhood of θ, ϕ into account, as well as adaptive sampling techniques, where the number
of trials increases over time as the quality of the solutions in the archive improves [154]. A recent variant of MAP-Elites
(Deep-Grid MAP-Elites) which updates a subpopulation of solutions for each cell in the behavior space has shown
significant benefits in sample efficiency [84]. We leave these exciting directions for future work.
||https://youtu.be/2-JCO3dUHsA
36
2.10.3 Limitations.
An important challenge is how to effectively characterize the behavior spaces. The behavior characteristics (BCs) are
the measures that we want to have diversity for. We note that this is is something that we may not know in advance. For
example, we selected distance between objects, and distance indeed has an effect, since there were no robot failures for
large distance values (see the blue regions in Fig. 2.3). The reason is that the hindsight optimization algorithm uses a
distance-based cost function to estimate the user’s goal, thus it is easier to disambiguate objects if they are further apart
(and vice versa).
But our experiment showed the unexpected and surprising edge case where the two objects are in column formation
and the human is nearly optimal. An interesting follow-up experiment would be to specify as a BC some metric of
object alignment in column formation and investigate further the effect of this variable. In general, a practitioner can
test the system with an initial design of BCs, observe the failure cases, create new BCs from newly observed insights
and test the system further.
While we have assumed bounded behavior spaces, the rationality coefficient does not meet this assumption, which
resulted in elites accumulating in the upper bound of the rationality in the archive (Fig. 2.3). Adapting the boundaries of
the space dynamically based on the distribution of generated scenarios [92] could improve coverage in this case.
A related challenge is the selected resolution of the space; a very coarse behavior space may hide interesting
scenarios, while a very fine space is harder to cover and inspect after the search. Here we selected the resolution so that
the total size of the behavior space is a fraction of the total number of evaluations. For instance, the size of the space
for BC2 and BC3 in section 2.7 is 2500, while the number of evaluations is 10,000. This would allow an optimal QD
algorithm sufficient time to explore the entire space. While determining the resolution of the behavior space is an open
problem in quality diversity [113], integrating quality diversity with representation learning techniques holds much
promise [263].
We focused on how to effectively search the generative space of scenarios, but not on the scenario generation
methods themselves. Realism is an important future consideration; To generate complex, realistic environments, we
need to find compact representations of environments and search these representations. For instance, we can learn a
generative model of environments and then search the latent space of the model [88]. For the human actions, although
we generate human actions with certain rationality measures, there is no guarantee that real users would behave in a
37
similar manner. In human training, realism can be measured through a modified Turing test designed to require humans
to distinguish generated scenarios from human authored ones [191]. Alternatively, we could do a user study where we
place objects in the same locations as our failure scenarios and observe whether participants perform similar actions that
cause failures. Future work should also include more complex human models [176] that account for users’ perception
of robot capabilities and their trust in the robot.
2.10.4 Implications.
Finding failure scenarios of HRI algorithms will play a critical role in the adoption of these systems. We proposed
quality diversity as an approach for automatically generating scenarios that assess the performance of HRI algorithms
in the shared autonomy domain and we illustrated a path for future work. While real-world studies are essential in
evaluating complex HRI systems, thoroughly evaluating a human-robot interaction system in a very large number
of different environments and with thousands of different users is often financially and practically infeasible. As a
complement to real-world user studies, automatic scenario generation can facilitate understanding and tuning of the
tested algorithms, as well as provide insights on the experimental design of real-world studies, whose findings can in
turn inform the designer for testing the system further. Quality diversity algorithms can be also applied as test oracles
in verification systems [170], as well as in other domains where deployed robotic systems face a diverse range of
closed-loop interaction scenarios.
38
Chapter 3
Quality Diversity Algorithms as Optimization
3.1 Covariance Matrix Adaptation MAP-Elites
3.1.1 Introduction
We focus on the challenge of finding a diverse collection of quality solutions on complex continuous domains. Consider
the example application of generating strategies for a turn-based strategy game (e.g., chess, Go, Hearthstone). What
makes these games appealing to human players is not the presence of an optimal strategy, but the variety of fundamentally
different viable strategies. For example, “aggressive” strategies aim to end the game early through high-risk high-reward
play, while “controlling” strategies delay the end of the game by postponing the targeting of the game objectives [198].
These example strategies vary in one aspect of their behavior, measured by the number of turns the game lasts.
Finding fundamentally different strategies requires navigating a continuous domain, such as the parameter space
(weights) of a neural network that maps states to actions, while simultaneously exploring a diverse range of behaviors.
Quality Diversity algorithms, such as Novelty Search with Local Competition (NSLC) [179] and Multi-dimensional
Archive of Phenotypic Elites (MAP-Elites) [53], drive a divergent search for multiple good solutions, rather than a
convergent search towards a single optimum, through sophisticated archiving and mapping techniques. Solutions are
binned in an archive based on their behavior and compete only with others exhibiting similar behaviors. Such stratified
competition results in the discovery of potentially sub-optimal solutions called stepping stones, which have been shown
in some domains to be critical for escaping local optima [178, 96]. However, these algorithms typically require a large
number of evaluations to achieve good results.
39
Figure 3.1: Comparing Hearthstone Archives. Sample archives for both MAP-Elites and CMA-ME from the Hearthstone
experiment. Our new method, CMA-ME, both fills more cells in behavior space and finds higher quality policies to
play Hearthstone than MAP-Elites. Each grid cell is an elite (high performing policy) and the intensity value represent
the win rate across 200 games against difficult opponents.
Meanwhile, when seeking a single global optimum, variants of the Covariance Matrix Adaptation Evolution Strategy
(CMA-ES) [127, 123] are among the best-performing derivative-free optimizers, with the capability to rapidly navigate
continuous spaces. However, the self-adaptation and cumulation techniques driving CMA-ES have yet to successfully
power a QD algorithm.
We propose a new hybrid algorithm called Covariance Matrix Adaptation MAP-Elites (CMA-ME), which rapidly
navigates and optimizes a continuous space with CMA-ES seeded by solutions stored in the MAP-Elites archive. The
hybrid algorithm employs CMA-ES’s ability to efficiently navigate continuous search spaces by maintaining a mixture
of normal distributions (candidate solutions) dynamically augmented by objective function feedback. The key insight of
CMA-ME is to leverage the selection and adaptation rules of CMA-ES to optimize good solutions, while also efficiently
exploring new areas of the search space.
We propose a new hybrid algorithm called Covariance Matrix Adaptation MAP-Elites (CMA-ME), which rapidly
navigates and optimizes a continuous space with CMA-ES seeded by solutions stored in the MAP-Elites archive. The
hybrid algorithm employs CMA-ES’s ability to efficiently navigate continuous search spaces by maintaining a mixture
of normal distributions (candidate solutions) dynamically augmented by objective function feedback. The key insight of
40
CMA-ME is to leverage the selection and adaptation rules of CMA-ES to optimize good solutions, while also efficiently
exploring new areas of the search space.
Building on the underlying structure of MAP-Elites, CMA-ME maintains a population of modified CMA-ES
instances called emitters, which like CMA-ES operate based on a sampling mean, covariance matrix, and adaptation rules
that specify how the covariance matrix and mean are updated for each underlying normal distribution. The population
of emitters can therefore be thought of as a Gaussian mixture where each normal distribution focuses on improving
a different area of behavior space. This paper explores three types of candidate emitters with different selection and
adaptation rules for balancing quality and diversity, called the random direction, improvement, and optimizing emitters.
Solutions generated by the emitters are saved in a single unified archive based on their corresponding behaviors.
We evaluate CMA-ME through two experiments: a toy domain designed to highlight current limitations of QD in
continuous spaces and a practical turn-based strategy game domain, Hearthstone, which mirrors a common application
of QD: finding diverse agent policies. Hearthstone is an unsolved, partially observable game that poses significant
challenges to current AI methods [136]. Overall, the results of both experiments suggest CMA-ME is a competitive
alternative to MAP-Elites for exploring continuous domains (Fig. 3.1). The potential for improving QD’s growing
number of applications is significant as our approach greatly reduces the computation time required to generate a diverse
collection of high-quality solutions.
3.1.2 Background
This section outlines previous advancements in quality diversity (QD) including one of the first QD algorithms,
MAP-Elites, and background in CMA-ES to provide context for the CMA-ME algorithm proposed in this paper.
3.1.2.1 Quality Diversity (QD)
QD algorithms are often applied in domains where a diversity of good but meaningfully different solutions is valued.
For example QD algorithms can build large repertoires of robot behaviors [55, 56, 51] or a diversity of locomotive
gaits to help robots quickly respond to damage [53]. By interacting with AI agents, QD can also produce a diversity of
generated video game levels [165, 5, 113].
41
While traditional evolutionary algorithms speciate based on encoding and fitness, a key feature of the precursor
to QD (e.g., Novelty Search (NS) [180]) is speciation through behavioral diversity [180]. Rather than optimizing
for performance relative to a fitness function, searching directly for behavioral diversity promotes the discovery of
sub-optimal solutions relative to the objective function. Called stepping stones, these solutions mitigate premature
convergence to local optima. To promote intra-niche competition [229] objectives were reintroduced in the QD
algorithms Novelty Search with Local Competition (NSLC) [179] and MAP-Elites [53].
While NSLC and MAP-Elites share many key features necessary for maintaining a diversity of quality solutions,
a core difference between the two is whether the archive is dynamically or statically generated. NSLC dynamically
creates behavior niches by growing an archive of sufficiently novel solutions while MAP-Elites (detailed in the next
section) maintains a static mapping of behavior. For CMA-ME we choose MAP-Elites as our diversity mechanism to
directly compare benefits inherited from CMA-ES. Though we make this design choice for the CMA-ME algorithm
solely for comparability, the same principles can be applied to the NSLC archive.
3.1.2.2 MAP-Elites
While one core difference between two early QD algorithms NSLC [179] and MAP-Elites [53] is whether behaviors are
dynamically or statically mapped, another is the number of behavioral dimensions among which solutions typically vary.
Rather than defining a single distance measure to characterize and differentiate behaviors, MAP-Elites often searches
along at least two measures called behavior characteristics (BCs) that induce a Cartesian space (called a behavior
space). This behavior space is then tessellated into uniformly spaced grid cells, where the goal of the algorithm is to 1)
maximize the number of grid cells containing solutions and 2) maximize the quality of the best solution within each grid
cell. Modifications and improvements to MAP-Elites often focus on the tessellation of behavior space [274, 252, 92].
However, this paper proposes improvements to MAP-Elites based on the generation of solutions rather than the
tessellation of behavior space. At the start of the MAP-Elites algorithm, the archive (map) is initialized randomly
by solutions sampled uniformly from the search space. Each cell of the map contains at most one solution (i.e., an
elite), which is the highest performing solution in that behavioral niche. New solutions are generated by taking an elite
(selected uniformly at random) and perturbing it with Gaussian noise. MAP-Elites computes a behavior vector for each
42
new solution and assigns the new solution to a cell in the map. The solution replaces the elite in its respective cell if the
new solution has higher fitness, or the new solution simply fills the cell if the cell is empty.
3.1.2.3 CMA-ES
Evolution strategies (ES) are a family of evolutionary algorithms that specialize in optimizing continuous spaces by
sampling a population of solutions, called a generation, and gradually moving the population toward areas of highest
fitness. One canonical type of ES is the (µ/µ, λ)-ES, where a population of λ sample solutions is generated, then the
fittest µ solutions are selected to generate new samples in the next generation. The (µ/µ, λ)-ES recombines the µ best
samples through a weighted average into one mean that represents the center of the population distribution of the next
generation. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a particular type of this canonical
ES, which is one of the most competitive derivative-free optimizers for single-objective optimization of continuous
spaces [124].
CMA-ES models the sampling distribution of the population as a multivariate normal distribution N (m, C) where
m is the distribution mean and C is its covariance matrix. The main mechanisms steering CMA-ES are the selection and
ranking of the µ fittest solutions, which update the next generation’s next sampling distribution, N (m, C). CMA-ES
maintains a history of aggregate changes to m called an evolution path, which provides benefits to search that are
similar to momentum in stochastic gradient descent.
3.1.2.4 Related Work
It is important to note that QD methods differ both from diversity maintenance methods and from multi-objective
optimization algorithms, in that QD methods search for solutions that exhibit different behaviors, which in our strategy
game example would be number of turns, rather than searching for diversity in parameter space (neural network weights
that induce a strategy). For example, consider the case of two different sets of network weights exhibiting similar play
styles. A diversity maintenance method, such as niching or speciation would consider them different species, while QD
would treat them as similar with respect to the exhibited behavior, forcing intra-niche competition. Several versions of
CMA-ES exist that incorporate niching or speciation [247, 228].
43
Multi-objective search could also be applied to our strategy game. By treating a behavior characteristic (game
length) as an additional objective, we aim to maximize or minimize the average game length in addition to maximizing
win rate. However, without any insight into our strategy game, it is unclear whether we should maximize or minimize
game length. Quality diversity algorithms differ from multi-objective search by seeking solutions across the whole
spectrum of this measure, rather than only at the extremes.
Several previous works explored incorporating ideas from a simplified ES. For example prior work [47] introduced
novelty seeking to a (µ/µ, λ)-ES. However, their ES does not leverage adaptation and perturbs solutions through
static multivariate Gaussian noise. Prior work [213] dynamically mutate solutions in MAP-Elites, but globally adapt σ
(mutation power) for all search space variables rather than adapting the covariances between search space variables.
Prior work [275] exploited correlations between elites in MAP-Elites, proposing a variation operator that accelerates the
MAP-Elites algorithm. Their approach exploits covariances between elites rather than drawing insights from CMA-ES
to model successful evolution steps as a covariance matrix.
3.1.3 Approach: The CMA-ME Algorithm
Through advanced mechanisms like step-size adaptation and evolution paths, CMA-ES can quickly converge to a
single optimum. The potential for CMA-ES to refine the best solutions discovered by MAP-Elites is clear. However,
the key insight making CMA-ME possible is that by repurposing these mechanisms from CMA-ES we can improve
the exploration capabilities of MAP-Elites. Notably, CMA-ES can efficiently both expand and contract the search
distribution to explore larger regions of the search space and stretch the normal distribution within the search space
to find hard to reach nooks and crannies within the behavior space. In CMA-ME we create a population of modified
CMA-ES instances called emitters that perform search with feedback gained from interacting with the archive.
At a high-level, CMA-ME is a scheduling algorithm for the population of emitters. Solutions are generated in search
space in a round-robin fashion, where each emitter generates the same number of solutions (see Alg. 5). The solutions
are generated in the same way for all emitters by sampling from the distribution N (m, C) (see generate solution in
Alg. 5). The procedure return solution is specific to each type of emitter used by CMA-ME and is responsible for
adapting the sampling distribution and maintaining the sampled population.
44
Algorithm 4 Covariance Matrix Adaptation MAP-Elites
CMA-ME (evaluate, n)
input :An evaluation function evaluate which computes a behavior characterization and fitness, and a desired
number of solutions n.
result :Generate n solutions storing elites in a map M.
1 Initialize population of emitters E
2 for i ← 1 to n do
3 Select emitter e from E which has generated the least solutions out of all emitters in E
4 xi ← generate solution(e) βi
, f itness ← evaluate(xi) return solution(e, xi
, βi
, f itness)
5 end
3.1.3.1 CMA-ME Emitters
To understand how emitters differ from CMA-ES instances, consider that the covariance matrix in CMA-ES models a
distribution of possible search directions within the search space. The distribution captures the most likely direction of
the next evolution step where fitness increase will be observed. Unlike estimation of multivariate normal algorithms
(EMNAs) that increase the likelihood of reobserving the previous best individuals (see Figure 3 of prior work [123]),
CMA-ES increases the likelihood of successful future evolution steps (steps that increase fitness). Emitters differ from
CMA-ES by adjusting the ranking rules that form the covariance matrix update to maximize the likelihood that future
steps in a given direction result in archive improvements. However, there are many ways to rank, leading to many
possible types of emitters.
We propose three types of emitters: optimizing, random direction, and improvement. Like CMA-ES, described in
Section 3.1.2.3, each emitter maintains a sampling mean m, a covariance matrix C, and a parameter set P that contains
additional CMA-ES related parameters (e.g., evolution path). However, while CMA-ES restarts its search based on the
best current solution, emitters are differentiated by their rules for restarting and adapting the sampling distribution, as
well as for selecting and ranking solutions.
We explore optimizing emitters to answer the question: are restarts alone enough to promote good exploration in
CMA-ME as they are in multi-modal methods? An optimizing emitter is almost identical to CMA-ES, differing only in
that restart means are chosen from the location of an elite rather than the fittest solution discovered so far. The random
direction and improvement emitters are described below in more detail.
To intuitively understand the random direction emitters, imagine trying to solve the classic QD maze domain in
the dark [180, 178]. Starting from an initial position in the behavior space, random direction emitters travel in a given
45
direction until hitting a wall, at which point they restart search and move in a new direction. While good solutions to the
maze and other low-dimensional behavior spaces can be found by random walks, the black-box nature of the forward
mapping from search space to behavior makes the inverse mapping equally opaque. Random direction emitters are
designed to estimate the inverse mapping of this correspondence problem.
When a random direction emitter restarts, it emulates a step in a random walk by selecting a random direction or
bias vector vβ to move toward in behavior space. To build a covariance matrix such that it biases in direction vβ at each
generation, solutions in search space are mapped to behavior space (βi). The mean (mβ) of all λ solutions is calculated
in behavior space and each direction with respect to the mean calculated. Only solutions that improve the archive are
then ranked by their projection value against the line mβ + vβt. If none of these solutions improve the archive, the
emitter restarts from a randomly chosen elite with a new bias vector vβ.
Interestingly, because random emitters choose the bias vector or direction in behavior space to move toward at the
beginning of a restart, it necessarily ignores areas of high fitness that it finds along the way. Instead, while exploring
improvement emitters exploit the areas of high fitness by ranking solutions based on the improvement or change in
fitness within each niche. When determining the amount of improvement, solutions filling empty cells are prioritized
over those replacing existing solutions in their niche by ranking them higher in the covariance matrix update. Rather
than exploring a fixed direction for the duration of the run, the advantage of improvement emitters is that they fluidly
adjust their goals based on where progress is currently being made.
Algorithm 5 shows the implementation of return solution from algorithm 5 for an improvement emitter. Each
solution xi
that has been generated by the emitter maps to a behavior βi and a cell M[βi
] in the map. If the cell is empty
(line 2), or if xi has higher fitness than the existing solution in the cell (line 6), xi
is added to the new generation’s
parents and the map is updated. The process repeats until the generation of xis reaches size λ (line 9), where we adapt
the emitter: If we have found parents that improved the map, we rank them (line 11) by concatenating two groups: first
the group of parents that discovered new cells in the map, sorted by their fitness, and second the group of parents that
improved existing cells, sorted by the increase in fitness over the previous solution that occupied that cell. If we have
not found any solutions that improve the map, we restart the emitter (line 15).
46
Algorithm 5 An improvement emitter’s return solution.
return solution (e, xi
, βi
, f itness)
input :An improvement emitter e, evaluated solution xi
, behavior vector βi
, and fitness.
result :The shared archive M is updated by solution xi
. If λ individuals have been generated since the last adaptation,
adapt the sampling distribution N (m, C) of e towards the behavioral regions of largest improvement.
7 Unpack the parents, sampling mean m, covariance matrix C, and parameter set P from e.
9 if M[βi
] is empty then
11 ∆i ←12f itness Flag that xi discovered a new cell 13 Add xi
to parents
15 else if xi
improves M[βi
] then
17 ∆i ← f itness − M[βi
]18.f itness Add xi
to parents
19 end
21 if sampled population is size λ then
23 if parents ̸= ∅ then
25 Sort parents by (newCell, ∆i)
27 Update m, C, and P28by parents parents ← ∅
30 else
32 Restart from random elite in M
33 end
34 end
3.1.4 Toy Domain
Previous work on applying QD to Hearthstone deck search [92] observed highly distorted behavior spaces. However, to
our knowledge, no previous work involving standard benchmark domains observes distortions in behavior space. The
goal of the toy domain is to create the simplest domain with high degrees of behavior space distortions. Surprisingly,
a linear projection from a high-dimensional search space to a low-dimensional behavior space highly distorts the
distribution of solutions in behavior space. This section details experiments in our toy domain designed to measure the
effects of distortions on MAP-Elites. We hypothesize the most likely benefit of CMA-ME is the ability to overcome
distortions in behavior space and this experiment measures the benefits of an adaptable QD algorithm over using a fixed
mutation power.
3.1.4.1 Distorted Behavior Spaces
Previous work in QD measures the effects of different types of behavior characteristics (BCs) on QD performance,
such as BC alignment with the objective [230, 229, 82]. We highlight a different limiting effect on performance: the
distortion caused by dimensionality reduction from search space to behavior space. While any dimensionality reduction
can increase the difficulty of covering behavior space, we demonstrate that exploring the behavior space formed from a
simple linear projection from a high-dimensional search space results in significant difficulty in standard MAP-Elites.
47
n=1
n=2
n=5
n=20
Figure 3.2: A Bates distribution demonstrating the narrowing property of behavior spaces formed by a linear projection.
Specifically, in the case of linear projection, each BC depends on every parameter of the search space. In the case
when all projection coefficients are equal, each parameter contributes equally to the corresponding BC. By being equally
dependent on each parameter, a QD algorithm needs to navigate every parameter to reach extremes in the behavior
space instead of only a subset of the parameters.
This is shown by uniformly sampling from the search space and projecting the samples to behavior vectors, where
each component is the sum of n uniform random variables. When divided by n (to normalize the BC to the range
[0, 1]), the sampling results in the Bates distribution shown in Fig. 3.2. As the dimensions of the search space grow, the
distribution of the behavior space narrows making it harder to find behaviors in the tails of the distribution.
We hypothesize that the adaptation mechanisms of CMA-ME will better cover this behavior space when compared
to MAP-Elites, since CMA-ME can adapt each parameter with a separate variance, rather than with a fixed global
mutation rate. Additionally, the final goal of this experiment is to explore the performance of these algorithms in a
distributed setting, therefore we choose parameters that allow for parallelization of the evaluation.
3.1.4.2 Experiments
To show the benefit of covariance matrix adaptation in CMA-ME, we compare the performance of MAP-Elites, CMAES, and CMA-ME when performing dimensionality reduction from the search space to behavior space. We additionally
compare against the recently proposed line-mutation operator for MAP-Elites [275], which we call ME (line). As QD
48
Table 3.1: Sphere Function Results
n = 20 n = 100
Algorithm Max Fitness Cells Occupied QD-Score Max Fitness Cells Occupied QD-Score
CMA-ES 100 3.46 % 731,613 100 3.74 % 725,013
MAP-Elites 99.596 56.22 % 11,386,641 96.153 26.97 % 5,578,919
ME (line) 99.920 68.99 % 13,714,405 98.021 31.75 % 6,691,582
CMA-ME (opt) 100 12.53 % 2,573,157 100 2.70 % 654,649
CMA-ME (rd) 98.092 90.32 % 13,651,537 96.731 77.12 % 13,465,879
CMA-ME (imp) 99.932 87.75 % 16,875,583 99.597 61.98 % 12,542,848
Table 3.2: Rastrigin Function Results
n = 20 n = 100
Algorithm Max Fitness Cells Occupied QD-Score Max Fitness Cells Occupied QD-Score
CMA-ES 99.982 4.17 % 818,090 99.886 3.64 % 660,037
MAP-Elites 90.673 55.70 % 9,340,327 81.089 26.51 % 4,388,839
ME (line) 92.700 58.25 % 9,891,199 84.855 27.72 % 4,835,294
CMA-ME (opt) 99.559 8.63 % 1,865,910 98.159 3.23 % 676,999
CMA-ME (rd) 91.084 87.74 % 10,229,537 90.801 74.13 % 10,130,091
CMA-ME (imp) 96.358 83.42 % 14,156,185 86.876 60.72 % 9,804,991
algorithms require a solution quality measure, we include two functions from the continuous black-box optimization set
of benchmarks [123, 124] as objectives.
Our two objective functions are of the form f : R
n → R: a sphere shown in Eq. 3.1 and the Rastrigin function
shown in Eq. 3.2. The optimal fitness of 0 in these functions is obtained by xi = 0. To avoid having the optimal value
at the center of the search space, we offset the fitness function so that, without loss of generality, the optimal location is
xi = 5.12 · 0.4 = 2.048 (note that [−5.12, 5.12] is the typical valid domain of the Rastrigin function).
sphere(x) = Xn
i=1
xi
2
(3.1)
rastrigin(x) = 10n +
Xn
i=1
[xi
2 − 10 cos(2πxi)] (3.2)
Behavior characteristics are formed by a linear projection from R
n to R
2
, and the behavior space is bounded
through a clip function (Eq. 3.3) that restricts the contribution of each component xi
to the range [−5.12, 5.12] (the
typical domain of the constrained Rastrigin function). To ensure that the behavior space is equally dependent on each
component of the search space (i.e., R
n), we assign equal weights to each component. The function p : R
n → R
2
49
formalizes the projection from the search space R
n to the behavior space R
2
(see Eq. 3.4), by computing the sum of the
first half of components from R
n and the sum of the second half of components from R
n.
clip(xi) =
xi
if −5.12 ≤ xi ≤ 5.12
5.12/xi otherwise
(3.3)
p(x) =
⌊
n
2X
⌋
i=1
clip(xi),
Xn
i=⌊
n
2
⌋+1
clip(xi)
(3.4)
We compare MAP-Elites, ME (line) CMA-ES, and CMA-ME on the toy domain, running CMA-ME three times,
once for each emitter type. We run each algorithm for 2.5M evaluations. We set λ = 500 for CMA-ES, whereas a
single run of CMA-ME deploys 15 emitters of the same type with λ = 37 *. The map resolution is 500 × 500.
Algorithms are compared by the QD-score metric proposed by previous work [230], which in MAP-Elites and
CMA-ME is the sum of fitness values of all elites in the map. To compute the QD-score of CMA-ES, solutions are
assigned a grid location on what their BC would have been and populate a pseudo-archive. Since QD-score assumes
maximizing test functions with non-negative values, fitness is normalized to the range [0, 100], where 100 is optimal.
Since we can analytically compute the boundaries of the space from the linear projection, we also show the percentage
of possible cells occupied by each algorithm.
MAP-Elites perturbs solutions with Gaussian noise scaled by a factor σ named mutation power. Previous work
[212, 72] shows that varying σ can greatly affect both the precision and coverage of MAP-Elites. To account for
this and obtain the best performance of MAP-Elites on the toy domain, we ran a grid search to measure MAP-Elites
performance across 101 values of σ uniformly distributed across [0.1, 1.1], and we selected the value with the best
coverage, σ = 0.5, for all the experiments. Since CMA-ES and CMA-ME adapt their sampling distribution, we did not
tune any parameters, but we set the initial value of their mutation power also to σ = 0.5.
3.1.4.3 Results
We label CMA-ME (opt), CMA-ME (rd), and CMA-ME (imp) for the optimizing, random direction, and improvement
emitters, respectively. Tables 3.1 and 3.2 show the results of the sphere and rastrigin function experiments. CMA-ES
*Both the Toy and Hearthstone domains use the same λ and emitter count.
50
outperforms all other algorithms in obtaining the optimal fitness. As predicted, covering the behavior space becomes
harder as the dimensions of the search space grow. CMA-ME (rd) and CMA-ME (imp) obtain the highest QD-Scores
and fill the largest number of unique cells in the behavior space. This is because of the ability of the CMA-ME
emitters to efficiently discover new cells. Notably, CMA-ME (opt) fails to keep up with CMA-ES, performing worse in
maximum fitness. We find this result consistent with the literature on multi-modal CMA-ES [125], since CMA-ES
moves with a single large population that has global optimization properties, while CMA-ME (opt) has a collection of
smaller populations with similar behavior.
0 1250000 2500000
Evaluations
0.0
0.7
1.4
QD-Score
1e7
Algorithm
MAP-Elites
ME (line)
CMA-ME (rd)
CMA-ME (imp)
Figure 3.3: Improvement in QD-Score over evaluations for the Sphere Function n = 100.
0 1250000 2500000
Evaluations
0.0
0.5
1.0
QD-Score
1e7
Algorithm
MAP-Elites
ME (line)
CMA-ME (rd)
CMA-ME (imp)
Figure 3.4: Improvement in QD-Score over evaluations for the Rastrigin Function n = 100.
Fig. 3.5 and Fig. 3.6 show the distribution of elites by their fitness. CMA-ME (rd) achieves a high QD-score by
retaining a large number of relatively low quality solutions, while CMA-ME (imp) fills slightly less cells but of higher
51
MAP-Elites
ME (line)
CMA-ME (rd)
0 20 40 60 80 100
Fitness
CMA-ME (imp)
Figure 3.5: The distribution of elites scaled by the number of occupied cells to show the relative makeup of elites within
each archives for the Sphere Function n = 100.
MAP-Elites
ME (line)
CMA-ME (rd)
0 20 40 60 80
Fitness
CMA-ME (imp)
Figure 3.6: The distribution of elites scaled by the number of occupied cells to show the relative makeup of elites within
each archives for the Rastrigin Function n = 100.
quality. In the Hearthstone experiment (Section 3.1.5) we use improvement emitters, since we are more interested in
elites of high quality than the number of cells filled in the resulting collection.
3.1.5 Hearthstone Domain
We aim to discover a collection of high quality strategies for playing the Hearthstone game against a set of fixed
opponents. We wish to explore how differently the game can be played rather than just how well the game can be played.
We selected this domain, since its behavior space has distortions similar to those described in section 3.1.4 and observed
in previous work [92], which make exploration of the behavior space particularly challenging.
52
CMA-ES
MAP-Elites
ME (line)
0 20 40 60
Win Rate
CMA-ME
Figure 3.7: Hearthstone Results The distribution of elites by win rate. Each distribution is scaled by the number of
occupied cells to show the relative makeup of elites within each archive.
Hearthstone [73] is a two-player, turn-taking adversarial online collectable card game that is an increasingly popular
domain for evaluating both classical AI techniques and modern deep reinforcement learning approaches due to the
many unique challenges it poses (e.g., large branching factor, partial observability, stochastic actions, and difficulty with
planning under uncertainty) [136]. In Hearthstone players construct a deck of exactly thirty cards that players place on
a board shared with their opponent. The game’s objective is to reduce your opponent’s health to zero by attacking your
opponent directly using cards in play. Decks are constrained by one of nine possible hero classes, where each hero class
can access different cards and abilities.
Rather than manipulating the reward function of individual agents in a QD system [207, 10] (like the QD approach
in AlphaStar [279]), generating the best gameplay strategy or deck [103, 264, 20], or searching for different decks with
a fixed strategy [92], our experiments search for a diversity of strategies for playing an expert-level human-constructed
deck.
3.1.5.1 Experiments
This section details the Hearthstone simulator, agent policy and deck, and our opponents’ policies and decks.
SabberStone Simulator: SabberStone [58] is a Hearthstone simulator that replicates the rules of Hearthstone and
uses the card definitions publicly provided by Blizzard. In addition to simulating the game, SabberStone includes a
53
0 25000 50000
Evaluations
0
20000
40000
60000
QD-Score
Algorithm
CMA-ES
CMA-ME
MAP-Elites
ME (line)
Figure 3.8: Hearthstone Results Improvement in QD-Score over evaluations.
0 25000 50000
Evaluations
0
10
20
30
40
50
60
Win Rate
Algorithm
CMA-ES
CMA-ME
MAP-Elites
ME (line)
Figure 3.9: Hearthstone Results Improvement in win rate over evaluations.
turn-local game tree search that searches possible action sequences that can be taken at a given turn. To implement
different strategies, users can create a heuristic scoring function that evaluates the state of the game. Included with the
simulator are standard “aggro and control” card-game strategies which we use in our opponent agents.
Our Deck: In Hearthstone there are subsets of cards that can only be used by specific “classes” of players. We selected
the class Rogue, where “cheap” cards playable in the beginning of the game can be valuable later on. Successful play
with the Rogue class requires long-term planning with sparse rewards. To our knowledge, our work is the first to create
a policy to play the Rogue class. We selected the Tempo Rogue archetype from the Hearthstone expansion Rise of
Shadows, which is a hard deck preferred by advanced players. While many variants of the Tempo Rogue archetype
54
exist, we decided to use the decklist from Hearthstone grandmaster Fei ”ETC” Liang who reached the number 1 ranking
with his list in May 2019 [183].
Opponents: Prior work [92] used game tree search to find decks and associated policies. They found six high
performing decks for the Paladin, Warlock, and Hunter classes playing aggro and control strategies; we use these as our
opponent suite.
Neural Network: We search for the parameters of a neural network that scores an observable game state based
on the cards played on the board and card-specific features. The network maps 15 evaluation features defined by
Sabberstone [58] to a scalar score. Prior work [50] show that a six-neuron neural network trained with a natural
evolution strategies (NES) [106], can obtain competitive and sometimes state-of-the-art solutions on the Atari Learning
Environment (ALE) [18]. They separate feature extraction from decision-making, using vector quantization for feature
extraction and the six-neuron network for the decision making. Motivated by this work, we use a 26 node fully
connected feed-forward neural network with layer sizes [15, 5, 4, 1] (109 parameters).
3.1.5.2 Search Parameters and Tuning:
We use the fitness function proposed in prior work [20, 92]: the average health difference between players at the end of
the game, as a smooth approximation of win rate. For MAP-Elites and CMA-ME, we characterize behavior by the
average hand size per turn and the average number of turns the game lasts. We choose these behavioral characteristics
to capture a spectrum of strategies between aggro decks, which try to end the game quickly, and control decks that
attempt to extend the game. The hand size dimension measures the ability of a strategy to generate new cards.
To tune MAP-Elites, we ran three experiments with σ values of 0.05, 0.3, and 0.8. MAP-Elites achieved the best
coverage and maximum win rate performance with σ = 0.05 and we used that as our mutation power. Our archive
was resolution 100 × 100, where we set the range of behavior values using data from the Hearthstone player data
corpus [138]. As with the Toy Domain in section 3.1.4, CMA-ES and CMA-ME used the same hyperparameters as
MAP-Elites. For CMA-ME we ran all experiments using improvement emitters, since their performance had the most
desirable performance attributes in the toy domain.
55
3.1.5.3 Distributed Evaluation
We ran our experiments on a high-performance cluster with 500 (8 core) CPUs in a distributed setting, with a master
search node and 499 worker nodes. Each worker node is responsible for evaluating a single policy at a time and plays 200
games against our opponent suite. A single experimental trial evaluating 50,000 policies takes 12 hours. MAP-Elites
and ME (line) are run asynchronously on the master search node, while CMA-ME and CMA-ES synchronize after each
generation. We ran each algorithm for 5 trials and generated 50, 000 candidate solutions per trial.
3.1.6 Results
Table 3.3 shows that CMA-ME outperforms both MAP-Elites and CMA-ES and in maximum fitness, maximum win
rate, the number of cells filled, and QD-Score. The distribution of elites for all three algorithms show that CMA-ME
finds elites in higher performing parts of the behavior space than both CMA-ES and MAP-Elites (see Fig. 3.7). The
sample archive for CMA-ME and MAP-Elites in Fig. 3.1 similarly illustrates that CMA-ME better covers the behavior
space and finds higher quality policies than MAP-Elites and ME (line).
Fig. 3.8 shows the increase in quality diversity over time, with CMA-ME more than doubling the QD-Score of
MAP-Elites. Fig. 3.9 shows the increase in win rate over time. CMA-ME maintains a higher win rate than CMA-ES,
MAP-Elites and ME (line) at all stages of evaluation. CMA-ES quickly converges to a single solution but is surpassed
by MAP-Elites later in the evaluation.
Table 3.3: Hearthstone Results
Maximum Overall All Solutions
Algorithm Fitness Win Rate Cells Filled QD-Score
CMA-ES -2.471 44.8 % 17.02 % 16,024.8
MAP-Elites -2.859 45.2 % 21.72 % 25,936.0
ME (line) -0.252 50.5 % 22.30 % 28,132.7
CMA-ME 5.417 65.9 % 29.17 % 63,295.6
3.1.7 Discussion
Both the toy domain and the Hearthstone domain share a property of the behavior space that challenges exploration
with standard MAP-Elites: some areas of the behavior space are hard to reach with random sampling even without the
presence of deception. Particularly in the toy domain, discovering solutions in the tails of the behavior space formed by
56
the Bates distribution described in Section 3.1.4.1 requires MAP-Elites to adapt its search distribution. While CMA-ES
finds individual solutions that outperform both QD algorithms, in all of the chosen domains it covers significantly less
area of the behavior space than any QD method.
Because CMA-ME covers (e.g., explores) more of the behavior space than MAP-Elites in the Hearthstone domain,
even though the mapping is likely non-linear and non-separable, effects similar to those shown with the Bates distribution
may be preventing MAP-Elites from discovering polices at the tails of behavior space like the ends of the “average
number of turns” dimension (horizontal axis in Fig. 3.1). However, some behavioral characteristics like the “average
hand size” are fairly equally covered by the two algorithms. Therefore, when selecting an algorithm, if maximal
coverage is desired we recommend CMA-ME in part for its ability to efficiently explore a distorted behavior space.
While CMA-ES is often the algorithm of choice for solving problems in continuous optimization, Fig. 3.9 shows in
Hearthstone that CMA-ME and MAP-Elites better optimize for fitness (i.e., win rate). Although interestingly in the first
half of the search, Fig. 3.8 shows that CMA-ES has a higher QD-score than MAP-Elites, demonstrating its potential
for exploration. While in the second half of search this exploration is not sustained as CMA-ES begins to converge, it
is possible that in this domain the objective function leads CMA-ES into a deceptive trap [178]. The best strategies
discovered early are likely aggro, and that by discovering these strategies early in the search, CMA-ES consistently
converges to the best aggro strategy instead of learning more complex control strategies.
While there are many possible emitter types, the CMA-ME algorithm proposes and explores three: optimizing,
random direction, and improvement emitters. Because of their limited map interaction and ranking purely by fitness,
optimizing emitters tend to explore a smaller area of behavior space. Random direction emitters alone can cover more of
the behavior space, but solution quality is higher when searching with improvement emitters (Fig. 3.1). Random emitters
additionally require a direction in behavior space, which may be impossible for certain domains [164]. If diversity is
more important than quality, we recommend random direction emitters, but alternatively recommend improvement
emitters if defining a random direction vector is challenging or if the application prioritizes quality over diversity. We
have only explored homogeneous emitter populations; we leave experimenting with heterogeneous populations of
emitters as an exciting topic for future work.
57
3.1.8 Conclusions
We presented a new algorithm called CMA-ME that combines the strengths of CMA-ES and MAP-Elites. Results from
both a toy domain and a policy search in Hearthstone show that leveraging strengths from CMA-ES can improve both
the coverage and quality of MAP-Elites solutions. Results from Hearthstone additionally show that when compared
to standard CMA-ES, the diversity components from MAP-Elites can improve the quality of solutions in continuous
search spaces. Overall, CMA-ME improves MAP-Elites by bringing modern optimization methods to quality-diversity
problems for the first time. By reconceptualizing optimization problems as quality-diversity problems, results from
CMA-ME suggest new opportunities for research and deployment, including any approach that learns policies encoded
as neural networks; complementing the objective function with behavioral characteristics can yield not only a useful
diversity of behavior, but also better performance on the original objective.
3.2 Differentiable Quality Diversity Optimization
3.2.1 Introduction
We introduce the problem of differentiable quality diversity (DQD) and propose the MAP-Elites via a Gradient
Arborescence (MEGA) algorithm as the first DQD algorithm.
Unlike single-objective optimization, quality diversity (QD) is the problem of finding a range of high quality
solutions that are diverse with respect to prespecified metrics. For example, consider the problem of generating realistic
images that match as closely as possible a target text prompt “Elon Musk”, but vary with respect to hair and eye color.
We can formulate the problem of searching the latent space of a generative adversarial network (GAN) [110] as a QD
problem of discovering latent codes that generate images maximizing a matching score for the prompt “Elon Musk”,
while achieving a diverse range of measures of hair and eye color, assessed by visual classification models [231]. More
generally, the QD objective is to maximize an objective f for each output combination of measure functions mi
. A QD
algorithm produces an archive of solutions, where the algorithm attempts to discover a representative for each measure
output combination, whose f value is as large as possible.
While our example problem can be formulated as a QD problem, all current QD algorithms treat the objective f
and measure functions mi as a black box. This means, in our example problem, current QD algorithms fail to take
58
Adapt
via CMA-ES
Sample Branch via
objective-measure
gradients
Update
archive
Ascend to
maximize
Figure 3.10: An overview of the Covariance Matrix Adaptation MAP-Elites via a Gradient Arborescence (CMA-MEGA)
algorithm. The algorithm leverages a gradient arborescence to branch in objective-measure space, while dynamically
adapting the gradient steps to maximize a QD objective (Eq. 3.5).
advantage of the fact that both f and mi are end-to-end differentiable neural networks. Our proposed differentiable
quality diversity (DQD) algorithms leverage first-order derivative information to significantly improve the computational
efficiency of solving a variety of QD problems where f and mi are differentiable.
To solve DQD, we introduce the concept of a gradient arborescence. Like gradient ascent, a gradient arborescence
makes greedy ascending steps based on the objective f. Unlike gradient ascent, a gradient arborescence encourages
exploration by branching via the measures mi
. We adopt the term arborescence from the minimum arborescence
problem [44] in graph theory, a directed counterpart to the minimum spanning tree problem, to emphasize the
directedness of the branching search.
Our work makes four main contributions. 1) We introduce and formalize the problem of differentiable quality
diversity (DQD). 2) We propose two DQD algorithms: Objective and Measure Gradient MAP-Elites via a Gradient
Arborescence (OMG-MEGA), an algorithm based on MAP-Elites [53], which branches based on the measures mi
but ascends based on the objective function f; and Covariance Matrix Adaptation MEGA (CMA-MEGA) which is
based on the CMA-ME [91] algorithm, and which branches based on the objective-measure space but ascends based on
maximizing the QD objective (Fig. 3.10). Both algorithms search directly in measure space and leverage the gradients of
f and mi
to form efficient parameter space steps in θ. 3) We show in three different QD domains (the linear projection,
the arm repertoire, and the latent space illumination (LSI) domains), that DQD algorithms significantly outperform
state-of-the-art QD algorithms that treat the objective and measure functions as a black box. 4) We demonstrate how
searching the latent space of a StyleGAN [158] in the LSI domain with CMA-MEGA results in a diverse range of
high-quality images.
59
3.2.2 Problem Definition
Quality Diversity. The quality diversity (QD) problem assumes an objective f : R
n → R in an n-dimensional
continuous space R
n and k measures mi
: R
n → R or, as a joint measure, m : R
n → R
k
. Let S = m(R
n) be the
measure space formed by the range of m. For each s ∈ S the QD objective is to find a solution θ ∈ R
n such that
m(θ) = s and f(θ) is maximized.
However, we note that R
k
is continuous, and an algorithm solving the quality diversity problem would require
infinite memory to store all solutions. Thus, QD algorithms in the MAP-Elites [202, 53] family approximate the
problem by discretizing S via a tessellation method. Let T be the tessellation of S into M cells. We relax the QD
objective to find a set of solutions θi, i ∈ {1, . . . , M}, such that each θi occupies one unique cell in T. The occupants
θi of all M cells form an archive of solutions. Each solution θi has a position in the archive m(θi), corresponding to
one out of M cells, and an objective value f(θi).
The objective of QD can be rewritten as follows, where the goal is to maximize the objective value for each cell in
the archive:
maxX
M
i=1
f(θi) (3.5)
Differentiable Quality Diversity. We define the differentiable quality diversity (DQD) problem, as a QD problem
where both the objective f and measures mi are first-order differentiable.
3.2.3 Preliminaries
We present several state-of-the-art derivative-free QD algorithms. Our proposed DQD algorithm MEGA builds upon
ideas from these works, while introducing measure and objective gradients into the optimization process.
MAP-Elites and MAP-Elites (line). MAP-Elites [53, 202] first tessellates the measure space S into evenly-spaced
grid cells. The upper and lower bounds for m are given as input to constrain S to a finite region. MAP-Elites first
samples solutions from a fixed distribution θ ∼ N (0, I), and populates an initial archive after computing f(θ) and
m(θ). Each iteration of MAP-Elites selects λ cells uniformly at random from the archive and perturbs each occupant θi
with fixed-variance σ isotropic Gaussian noise: θ
′ = θi + σN (0, I). Each new candidate solution θ
′
is then evaluated
60
and added to the archive if θ
′ discovers a new cell or improves an existing cell. The algorithm continues to generate
solutions for a specified number of iterations.
Later work introduced the Iso+LineDD operator [275]. The Iso+LineDD operator samples two archive solutions θi and θj , then blends a Gaussian perturbation with a noisy interpolation given hyperparameters σ1 and σ2:
θ
′ = θi + σ1N (0, I) + σ2N (0, 1)(θi − θj ). In this paper we refer to MAP-Elites with an Iso+LineDD operator as
MAP-Elites (line).
CMA-ME. Covariance Matrix Adaptation MAP-Elites (CMA-ME) [91] combines the archiving mechanisms of MAPElites with the adaptation mechanisms of CMA-ES [123]. While MAP-Elites creates new solutions by perturbing
existing solutions with fixed-variance Gaussian noise, CMA-ME maintains a full-rank Gaussian distribution N (µ, Σ)
in parameter space R
n. Each iteration of CMA-ME samples λ candidate solutions θi ∼ N (µ, Σ), evaluates each
solution, and updates the archive based on the following rule: if there is a previous occupant θp at the same cell, we
compute ∆i = f(θi) − f(θp), otherwise if the cell is empty we compute ∆i = f(θi). We then rank the sampled
solutions by increasing improvement ∆i
, with an extra criteria that candidates discovering new cells are ranked higher
than candidates that improve existing cells. We then update N (µ, Σ) with the standard CMA-ES update rules based on
the improvement ranking. CMA-ME restarts when all λ solutions fail to change the archive. On a restart we reset the
Gaussian N (θi, I), where θi is an archive solution chosen uniformly at random, and all internal CMA-ES parameters.
In the supplemental material, we derive, for the first time, a natural gradient interpretation of CMA-ME’s improvement
ranking mechanism, based on previous theoretical work on CMA-ES [3].
3.2.4 Algorithms
We present two variants of our proposed MEGA algorithm: OMG-MEGA and CMA-MEGA. We form each variant by
adapting the concept of a gradient arborescence to the MAP-Elites and CMA-ME algorithms, respectively. Finally, we
introduce two additional baseline algorithms, OG-MAP-Elites and OG-MAP-Elites (line), which operate only on the
gradients of the objective.
OMG-MEGA. We first derive the Objective and Measure Gradient MAP-Elites via Gradient Arborescence (OMGMEGA) algorithm from MAP-Elites.
61
First, we observe how gradient information could benefit a QD algorithm. Note that the QD objective is to explore
the measure space, while maximizing the objective function f. We observe that maximizing a linear combination of
measures : Pk
j=1 cjmj (θ), where c is a k-dimensional vector of coefficients, enables movement in a k-dimensional
measure space. Adding the objective function f to the linear sum enables movement in an objective-measure space.
Maximizing g with a positive coefficient of f results in steps that increasing f, while the direction of movement in the
measure space is determined by the sign and magnitude of the coefficients cj .
g(θ) = |c0|f(θ) +X
k
j=1
cjmj (θ) (3.6)
We can then derive a direction function that perturbs a given solution θ based on the gradient of our linear
combination g: ∇g(θ) = |c0|∇f(θ) + Pk
j=1 cj∇mj (θ) .
We incorporate the direction function ∇g to derive a gradient-based MAP-Elites variation operator. We observe
that MAP-Elites samples a cell θi and perturbs the occupant with Gaussian noise: θ
′ = θi + σN (0, I). Instead, we
sample coefficents c ∼ N (0, σgI) and step:
θ
′ = θi+ | c0 | ∇f(θi) +X
k
j=1
cj∇mj (θi) (3.7)
The value σg acts as a learning rate for the gradient step, because it controls the scale of the coefficients c ∼
N (0, σgI). To balance the contribution of each function, we normalize all gradients. Other than our new gradient-based
operator, OMG-MEGA is identical to MAP-Elites.
CMA-MEGA. Next, we derive the Covariance Matrix Adaptation MAP-Elites via a Gradient Arborescence (CMAMEGA) algorithm from CMA-ME. Fig. 3.10 shows an overview of the algorithm.
First, we note that we sample c in OMG-MEGA from a fixed-variance Gaussian. However, it would be beneficial to
select c based on how c, and the subsequent gradient step on θ, improve the QD objective defined in equation 3.5.
We frame the selection of c as an optimization problem with the objective of maximizing the QD objective (Eq. 3.7).
We model a distribution of coefficients c as a k + 1-dimensional Gaussian N (µ, Σ). Given a θ, we can sample
c ∼ N (µ, Σ), compute θ
′ via Eq. 3.7, and adapt N (µ, Σ) towards the direction of maximum increase of the QD
objective (see Eq. 3.5).
62
We follow an evolution strategy approach to model and dynamically adapt the sampling distribution of coefficients
N (µ, Σ). We sample a population of λ coefficients from ci ∼ N (µ, Σ) and generate λ solutions θi. We then compute
∆i from CMA-ME’s improvement ranking for each candidate solution θi. By updating N (µ, Σ) with CMA-ES update
rules for the ranking ∆i
, we dynamically adapt the distribution of coefficients c to maximize the QD objective.
Algorithm 6 shows the pseudocode for CMA-MEGA. In line 37 we evaluate the current solution and compute an
objective value f, a vector of measure values m, and gradient vectors. As we dynamically adapt the coefficients c,
we normalize the objective and measure gradients (line 38) for stability. Because the measure space is tessellated, the
measures m place solution θ into one of the M cells in the archive. We then add the solution to the archive (line 39), if
the solution discovers an empty cell in the archive, or if it improves an existing cell, identically to MAP-Elites.
We then use the gradient information to compute a step that maximizes improvement of the archive. In lines 40-46,
we sample a population of λ coefficients from a multi-variate Gaussian retained by CMA-ES, and take a gradient step
for each sample. We evaluate each sampled solution θ
′
i
, and compute the improvement ∆i (line 45). As in CMA-ME,
we specify ∆i as the difference in the objective value between the sampled solution θi and the existing solution, if one
exists, or as the absolute objective value of the sampled solution if θi belongs to an empty cell.
In line 47, we rank the sampled gradients ∇i based on their respective improvements. As in CMA-ME, we prioritize
exploration of the archive by ranking first by their objective values all samples that discover new cells, and subsequently
all samples that improve existing cells by their difference in improvement. We then compute an ascending gradient step
as a linear combination of gradients (line 48), following the recombination weights wi from CMA-ES [123] based on
the computed improvement ranking. These weights correspond to the log-likelihood probabilities of the samples in the
natural gradient interpretation of CMA-ES [3].
In line 50, CMA-ES adapts the multi-variate Gaussian N (µ, Σ), as well as internal search parameters p, from the
improvement ranking of the coefficients. In the supplemental material, we provide a natural gradient interpretation
of the improvement ranking rules of CMA-MEGA, where we show that the coefficient distribution of CMA-MEGA
approximates natural gradient steps of maximizing a modified QD objective.
CMA-MEGA (Adam). We add an Adam-based variant of CMA-MEGA, where we replace line 49 with an Adam
gradient optimization step [167].
63
OG-MAP-Elites. To show the importance of taking gradient steps in the measure space, as opposed to only taking
gradient steps in objective space and directly perturbing the parameters, we derive two variants of MAP-Elites as a
baseline that draw insights from the recently proposed Policy Gradient Assisted MAP-Elites (PGA-ME) algorithm [211].
PGA-ME combines the Iso+LineDD operator [275] with a policy gradient operator only on the objective. Similarly,
our proposed Objective-Gradient MAP-Elites (OG-MAP-Elites) algorithm combines an objective gradient step with a
MAP-Elites style perturbation operator. Each iteration of OG-MAP-Elites samples λ solutions θi from the archive.
Each θi is perturbed with Gaussian noise to form a new candidate solution θ
′
i = θi + σN (0, I). OG-MAP-Elites
evaluates the solution and updates the archive, exactly as in MAP-Elites. However, OG-MAP-Elites takes one additional
step: for each θ
′
i
, the algorithm computes ∇f(θ
′
i
), forms a new solution θ
′′
i = θ
′
i +η∇f(θ
′
i
) with an objective gradient
step, and evaluates θ
′′
i
. Finally, we update the archive with all solutions θ
′
i
and θ
′′
i
.
OG-MAP-Elites (line). Our second baseline, OG-MAP-Elites (line) replaces the Gaussian operator with the
Iso+LineDD operator [275]: θ
′ = θi + σ1N (0, I) + σ2N (0, 1)(θi − θj ). We consider OG-MAP-Elites (line) a
DQD variant of PGA-ME. However, PGA-ME was designed as a reinforcement learning (RL) algorithm, thus many of
the advantages gained in RL settings are lost in OG-MAP-Elites (line). We provide a detailed discussion and ablations
in the supplemental material.
3.2.5 Domains
DQD requires differentiable objective and measures, thus we select benchmark domains from previous work in the QD
literature where we can compute the gradients of the objective and measure functions.
Linear Projection. To show the importance of adaptation mechanisms in QD, the CMA-ME paper [91] introduced
a simple domain, where reaching the extremes of the measures is challenging for non-adaptive QD algorithms. The
domain forms each measure mi by a linear projection from R
n to R, while bounding the contribution of each component
θi
to the range [−5.12, 5.12].
We note that uniformly sampling from a hypercube in R
n results in a narrow distribution of the linear projection in
R [91, 153]. Increasing the number of parameters n makes the problem of covering the measure space more challenging,
because to reach an extremum mi(θ) = ±5.12n, all components must equal the extremum: θ[i] = ±5.12.
64
Algorithm 6 Covariance Matrix Adaptation MAP-Elites via a Gradient Aborescence (CMA-MEGA)
CMA-MEGA (evaluate, θ0, N, λ, η, σg)
input :An evaluation function evaluate which computes the objective, the measures, gradients of the objective and
measures, an initial solution θ0, a desired number of iterations N, a branching population size λ, a learning
rate η, and an initial step size for CMA-ES σg.
result :Generate N(λ + 1) solutions storing elites in an archive A.
35 Initialize solution parameters θ to θ0, CMA-ES parameters µ = 0, Σ = σgI, and p, where we let p be the CMA-ES
internal parameters.
36 for iter ← 1 to N do
37 f, ∇f ,m, ∇m ← evaluate(θ)
38 ∇f ← normalize(∇f ), ∇m ← normalize(∇m)
39 update archive(θ, f,m)
40 for i ← 1 to λ do
41 c ∼ N (µ, Σ)
42 ∇i ← c0∇f +
Pk
j=1 cj∇mj
43 θ
′
i ← θ + ∇i
44 f
′
, ∗,m′
, ∗ ← evaluate(θ
′
i
)
45 ∆i ← update archive(θ
′
i
, f′
,m′
)
46 end
47 rank ∇i by ∆i
48 ∇step ←
Pλ
i=1 wi∇rank[i]
49 θ ← θ + η∇step
50 Adapt CMA-ES parameters µ, Σ, p based on improvement ranking ∆i
51 if there is no change in the archive then
52 Restart CMA-ES with µ = 0, Σ = σgI.
53 Set θ to a randomly selected existing cell θi from the archive
54 end
55 end
We select this domain as a benchmark to highlight the need for adaptive gradient coefficients for CMA-MEGA
as opposed to constant coefficients for OMG-MEGA, because reaching the edges of the measure space requires
dynamically shrinking the gradient steps.
As a QD domain, the domain must provide an objective. The CMA-ME study [91] introduces two variants of the
linear projection domain with an objective based on the sphere and Rastrigin functions from the continuous black-box
optimization set of benchmarks [123, 124]. We optimize an n = 1000 unbounded parameter space R
n. We provide
more detail in the supplemental material.
Arm Repertoire. We select the robotic arm repertoire domain from previous work [53, 275]. The goal in this domain
is to find an inverse kinematics (IK) solution for each reachable position of the end-effector of a planar robotic arm
with revolute joints. The objective f of each solution is to minimize the variance of the joint angles, while the measure
functions are the positions of the end effector in the x and y-axis, computed with the forward kinematics of the planar
arm [206]. We selected a 1000-DOF robotic arm.
65
LP (sphere) LP (Rastrigin) Arm Repertoire LSI
Algorithm QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage
MAP-Elites 1.04 1.17% 1.18 1.72% 1.97 8.06% 13.88 23.15%
MAP-Elites (line) 12.21 14.32% 8.12 11.79% 33.51 35.79% 16.54 25.73%
CMA-ME 1.08 1.21% 1.21 1.76% 55.98 56.95% 18.96 26.18%
OG-MAP-Elites 1.52 1.67% 0.83 1.26% 57.17 58.08% N/A N/A
OG-MAP-Elites (line) 15.01 17.41% 6.10 8.85% 59.66 60.28% N/A N/A
OMG-MEGA 71.58 92.09% 55.90 77.00% 44.12 44.13% N/A N/A
CMA-MEGA 75.29 100.00% 62.54 100.00% 74.18 74.18% 5.36 8.61%
CMA-MEGA (Adam) 75.30 100.00% 62.58 100.00% 73.82 73.82% 21.82 30.73%
Table 3.4: Mean QD-score and coverage values after 10,000 iterations for each algorithm per domain.
Latent Space Illumination. Previous work [89] introduced the problem of exploring the latent space of a generative
model directly with a QD algorithm. The authors named the problem latent space illumination (LSI). As the original
LSI work evaluated non-differentiable objectives and measures, we create a new benchmark for the differentiable LSI
problem by generating images with StyleGAN [158] and leveraging CLIP [231] to create differentiable objective and
measure functions. We adopt the StyleGAN+CLIP [223] pipeline, where StyleGAN-generated images are passed to
CLIP, which in turn evaluates how well the generated image matches a given text prompt. We form the prompt “Elon
Musk with short hair.” as the objective and for the measures we form the prompts “A person with red hair.” and “A man
with blue eyes.”. The goal of DQD becomes generating faces similar to Elon Musk with short hair, but varying with
respect to hair and eye color.
3.2.6 Experiments
We conduct experiments to assess the performance of the MEGA variants. In addition to our OG-MAP-Elites baselines,
which we propose in section 3.2.4, we compare the MEGA variants with the state-of-the-art QD algorithms presented
in section 3.2.3. We implemented MEGA and OG-MAP-Elites variants in the Pyribs [266] QD library and compare
against the existing Pyribs implementations of MAP-Elites, MAP-Elites (line), and CMA-ME.
3.2.6.1 Experiment Design
Independent Variables. We follow a between-groups design, where the independent variables are the algorithm
and the domain (linear projection, arm repertoire, and LSI). We did not run OMG-MEGA and OG-MAP-Elites in
the LSI domain; while CMA-MEGA computes the f and mi gradients once per iteration (line 37 in Algorithm 6),
66
OMG-MEGA and OG-MAP-Elites compute the f and mi gradients for every sampled solution, making their execution
cost-prohibitive for the LSI domain.
Dependent Variables. We measure both the diversity and the quality of the solutions returned by each algorithm. These
are combined by the QD-score metric [230], which is defined as the sum of f values of all cells in the archive (Eq. 3.5).
To make the QD-score invariant with respect to the resolution of the archive, we normalize QD-score by the archive size
(the total number of cells from the tessellation of the measure space). As an additional metric of diversity we compute
the coverage as the number of occupied cells in the archive divided by the total number of cells. We run each algorithm
for 20 trials in the linear projection and arm repertoire domains, and for 5 trials in the LSI domain, resulting in a total of
445 trials.
3.2.6.2 Analysis
Table 3.4 shows the metrics of all the algorithms, averaged over 20 trials for the benchmark domains and over 5 trials
for the LSI domain. We conducted a two-way ANOVA to examine the effect of algorithm and domain (linear projection
(sphere), linear projection (Rastrigin), arm repertoire) on the QD-Score. There was a statistically significant interaction
between the search algorithm and the domain (F(14, 456) = 7328.18, p < 0.001). Simple main effects analysis
with Bonferroni corrections showed that CMA-MEGA and OMG-MEGA performed significantly better than each
of the baselines in the sphere and Rastrigin domains, with CMA-MEGA significantly outperforming OMG-MEGA.
CMA-MEGA also outperformed all the other algorithms in the arm repertoire domain.
We additionally conducted a one-way ANOVA to examine the effect of algorithm on the LSI domain. There was a
statistically significant difference between groups (F(4, 20) = 260.64, p < 0.001). Post-hoc pairwise comparisons
with Bonferroni corrections showed that CMA-MEGA (Adam) significantly outperformed all other algorithms, while
CMA-MEGA without the Adam implementation had the worst performance.
Both OMG-MEGA and CMA-MEGA variants perform well in the linear projection domain, where the objective
and measure functions are additively separable, and the partial derivatives with respect to each parameter independently
capture the steepest change of each function. We observe that OG-MAP-Elites performs poorly in this domain. Analysis
shows that the algorithm finds a nearly perfect best solution for the sphere objective, but it interleaves following the
67
CMA-MEGA (Adam) MAP-Elites (line)
0
100
0 5000 10000
IterDtioQs
0
80
QD-Score
C0A-0(GA (AGDP)
C0A-0(GA
O0G-0(GA
OG-0AP-(Oites (OiQe)
OG-0AP-(Oites
C0A-0(
0AP-(Oites (OiQe)
0AP-(Oites
CMA-MEGA (Adam) CMA-ME
0
100
0 5000 10000
IterDtioQs
0
80
QD-Score
0 5000 10000
IterDtioQs
0
25
QD-Score
CMA-MEGA (Adam) CMA-ME
0
100
Figure 3.11: QD-Score plot with 95% confidence intervals and heatmaps of generated archives by CMA-MEGA (Adam)
and the strongest derivative-free competitor for the linear projection sphere (top), arm repertoire (middle), and latent
space illumination (bottom) domains.
gradient of the objective with exploring the archive as in standard MAP-Elites, resulting in smaller coverage of the
archive.
In the arm domain, OMG-MEGA manages to reach the extremes of the measure space, but the algorithm fails
to fill in nearby cells. The OG-MAP-Elites variants perform significantly better than OMG-MEGA, because the
top-performing solutions in this domain tend to be concentrated in an “elite hypervolume” [275]; moving towards
68
2.7 3.25 3.8 4.35 4.9 5.45 6.0
A man with blue eyes.
6.0
5.45
4.9
4.35
3.8
3.25
2.7
A person with red hair.
A man with blue eyes.
A person with red hair.
Figure 3.12: Result of latent space illumination for the objective prompt “Elon Musk with short hair.” and for the
measure prompts “A person with red hair.” and “A man with blue eyes.”. The axes values indicate the score returned by
the CLIP model, where lower score indicates a better match.
the gradient of the objective finds top-performing cells, while applying isotropic perturbations to these cells fills in
nearby regions in the archive. CMA-MEGA variants retain the best performance in this domain. Fig. 3.10 shows a
high-precision view of the CMA-MEGA (Adam) archive for the arm repertoire domain.
We did not observe a large difference between the CMA-MEGA (Adam) and our gradient descent implementation
in the first two benchmark domains, where the curvature of the search space is well-conditioned. On the other hand, in
the LSI domain CMA-MEGA without the Adam implementation performed poorly. We conjecture that this is caused
by the conditioning of the mapping from the latent space of the StyleGAN to the CLIP score.
Fig. 3.11 shows the QD-score values for increasing number of iterations for each of the tested algorithms, with
95% confidence intervals. The figure also presents heatmaps of the CMA-MEGA (Adam) and the generated archive
of the strongest QD competitor for each of the three domains. We provide generated archives of all algorithms in the
supplemental material.
69
We visualize the top performing solutions in the LSI domain by uniformly sampling solutions from the archive of
CMA-MEGA (Adam) and showing the generated faces in Fig. 3.12. We observe that as we move from the top right to
the bottom left, the features matching the captions “a man with blue eyes” and “a person with red hair” become more
prevalent. We note that these solutions were generated from a single run of CMA-MEGA (Adam) for 10,000 iterations.
Overall, these results show that using the gradient information in quality diversity optimization results in significant
benefits to search efficiency, but adapting the gradient coefficients with CMA-ES is critical in achieving top performance.
3.2.7 Related Work
Quality Diversity. The precursor to QD algorithms [229] originated with diversity-driven algorithms as a branch of
evolutionary computation. Novelty search [178], which maintains an archive of diverse solutions, ensures diversity
though a provided metric function and was the first diversity-driven algorithm. Later, objectives were introduced
as a quality metric resulting in the first QD algorithms: Novelty Search with Local Competition (NSLC) [179] and
MAP-Elites [53, 202]. Since their inception, many works have improved the archives [92, 274, 252], the variation
operators [275, 91, 47, 213], and the selection mechanisms [54, 245] of both NSLC and MAP-Elites. While the
original QD algorithms were based on genetic algorithms, algorithms based on other derivative-free approaches such as
evolution strategies [91, 46, 213, 47] and Bayesian optimization [163] have recently emerged.
Being stochastic derivative-free optimizers [38], QD algorithms are frequently applied to reinforcement learning
(RL) problems [221, 10, 71] as derivative information must be estimated in RL. Naturally, approaches combining QD
and RL have started to emerge [211, 226]. Unlike DQD, these approaches estimate the gradient of the reward function,
and in the case of QD-RL a novelty function, in action space and backpropagate this gradient through a neural network.
Our proposed DQD problem differs by leveraging provided – rather than approximated – gradients of the objective and
measure functions.
Several works have proposed model-based QD algorithms. For example, the DDE-Elites algorithm [98] dynamically
trains a variational autoencoder (VAE) on the MAP-Elites archive, then leverages the latent space of this VAE by
interpolating between archive solutions in latent space as a type of crossover operator. DDE-Elites learns a manifold of
the archive data as a representation, regularized by the VAE’s loss function, to solve downstream optimization tasks
efficiently in this learned representation. The PoMS algorithm [232] builds upon DDE-Elites by learning an explicit
70
manifold of the archive data via an autoencoder. To overcome distortions introduced by an explicit manifold mapping,
the authors introduce a covariance perturbation operator based on the Jacobian of the decoder network. These works
differ from DQD by dynamically constructing a learned representation of the search space instead of leveraging the
objective and measure gradients directly.
Latent Space Exploration. Several works have proposed a variety of methods for directly exploring the latent space of
generative models. Methods on GANs include interpolation [273], gradient descent [24], importance sampling [286],
and latent space walks [141]. Derivative-free optimization methods mostly consist of latent variable evolution (LVE) [25,
101], the method of optimizing latent space with an evolutionary algorithm. LVE was later applied to generating Mario
levels [280] with targeted gameplay characteristics. Later work [89] proposed latent space illumination (LSI), the
problem of exploring the latent space of a generative model with a QD algorithm. The method has only been applied to
procedurally generating video game levels [89, 259, 244, 88] and generating MNIST digits [293]. Follow-up work
explored LSI on VAEs [243]. Our work improves LSI on domains where gradient information on the objective and
measures is available with respect to model output.
3.2.8 Societal Impacts
By proposing gradient-based analogs to derivative-free QD methods, we hope to expand the potential applications of
QD research and bring the ideas of the growing QD community to a wider machine learning audience. We are excited
about future mixing of ideas between QD, generative modeling, and other machine learning subfields.
In the same way that gradient descent is used to synthesize super-resolution images [197], our method can be used
in the same context, which would raise ethical considerations due to potential biases present in the trained model [33].
On the other hand, we hypothesize that thoughtful selection of the measure functions may help counterbalance this
issue, since we can explicitly specify the measures that ensure diversity over the collection of generated outputs. For
example, a model may be capable of generating a certain type of face, but the latent space may be organized in a way
which biases a gradient descent on the latent space away from a specific distribution of faces. If the kind of diversity
required is differentiably measurable, then DQD could potentially help resolve which aspect of the generative model,
i.e., the structure of the latent space or the representational capabilities of the model, is contributing to the bias.
71
Finally, we recognize the possibility of using this technology for malicious purposes, including generation of fake
images (“DeepFakes”), and we highlight the utility of studies that help identify DeepFake models [116].
3.2.9 Limitations and Future Work
Quality diversity (QD) is a rapidly emerging field [38] with applications including procedural content generation [113],
damage recovery in robotics [53, 202], efficient aerodynamic shape design [97], and scenario generation in human-robot
interaction [85, 88]. We have introduced differentiable quality diversity (DQD), a special case of QD, where measure and
objective functions are differentiable, and have shown how a gradient arborescence results in significant improvements
in search efficiency.
As both MEGA variants are only first order differentiable optimizers, we expect them to have difficulty on highly
ill-conditioned optimization problems. CMA-ES, as an approximate second order optimizer, retains a full-rank
covariance matrix that approximates curvature information and is known to outperform quasi-Newton methods on
highly ill-conditioned problems [106]. CMA-ME likely inherits these properties by leveraging the CMA-ES adaptation
mechanisms and we expect it to have an advantage on ill-conditioned objective and measure functions.
While we found CMA-MEGA to be fairly robust to hyperparameter changes in the first two benchmark domains
(linear projection, arm repertoire), small changes of the hyperparameters in the LSI domain led CMA-MEGA, as well as
all the QD baselines, to stray too far from the mean of the latent space, which resulted in many artifacts and unrealistic
images. One way to address this limitation is to constrain the search region to a hypersphere of radius √
d, where d is
the dimensionality of the latent space, as done in previous work [197].
While CLIP achieves state-of-the-art performance in classifying images based on visual concepts, the model does
not measure abstract concepts. Ideally, we would like to specify “age” as a measure function and obtain quantitative
estimates of age given an image of a person. We believe that the proposed work on the LSI domain will encourage
future research on this topic, which we would in turn be able to integrate with DQD implementations to generate diverse,
high quality content.
Many problems, currently modelled as optimization problems, may be fruitfully redefined as QD problems, including
the training of deep neural networks. Our belief stems from recent works [238, 187], which reformulated deep learning
as a multi-objective optimization problem. However, QD algorithms struggle with high-variance stochastic objectives
72
and measures [154, 84], which naturally conflicts with minibatch training in stochastic gradient descent [27]. These
challenges need to be addressed before DQD training of deep neural networks becomes tractable.
3.3 Soft Archives for Quality Diversity Optimization
3.3.1 Introduction
Consider an example problem of searching for celebrity faces in the latent space of a generative model. As a singleobjective optimization problem, we specify an objective f that targets a celebrity such as Tom Cruise. A single-objective
optimizer, such as CMA-ES [123], will converge to a single solution of high objective value, an image that looks like
Tom Cruise as much as possible.
However, this objective has ambiguity. How old was Tom Cruise in the photo? Did we want the person in the image
to have short or long hair? By instead framing the problem as a quality diversity optimization problem, we additionally
specify a measure function m1 that quantifies age and a measure function m2 that quantifies hair length. A quality
diversity algorithm [229, 38], such as CMA-ME [91], can then optimize for a collection of images that are diverse with
respect to age and hair length, but all look like Tom Cruise.
While prior work [91, 88, 89, 70] has shown that CMA-ME solves such QD problems efficiently, three important
limitations of the algorithm have been discovered. First, on difficult to optimize objectives, variants of CMA-ME will
abandon the objective too soon [267], and instead favor exploring the measure space, the vector space defined by the
measure function outputs. Second, the CMA-ME algorithm struggles to explore flat objective functions [217]. Third,
CMA-ME works well on high-resolution archives, but struggles to explore low-resolution archives [52, 87]. †
We propose a new algorithm, CMA-MAE, that addresses these three limitations. To address the first limitation,
we derive an algorithm that smoothly blends between CMA-ES and CMA-ME. First, consider how CMA-ES and
CMA-ME differ. At each step CMA-ES’s objective ranking maximizes the objective function f by approximating
the natural gradient of f at the current solution point [3]. In contrast, CMA-ME’s improvement ranking moves in
the direction of the natural gradient of f − fA at the current solution point, where fA is a discount function equal to
the objective of the best solution so far that has the same measure values as the current solution point. The function
†We note that archive resolution affects the performance of all current QD algorithms.
73
f
fA(α = 1.0)
fA(α = 0.01)
fA(α = 0.0)
X
Y
Figure 3.13: An example of how different α values affect the function f − fA optimized by CMA-MAE after a fixed
number of iterations. Here f is a bimodal objective where mode X is harder to optimize than mode Y , requiring more
optimization steps, and modes X and Y are separated by measure m1. For α = 0, the objective f is equivalent to
f − fA, as fA remains constant. For larger values of α, CMA-MAE discounts region Y in favor of prioritizing the
optimization of region X.
f − fA quantifies the gap between a candidate solution and the best solution so far at the candidate solution’s position
in measure space.
Our key insight is to anneal the function fA by a learning rate α. We observe that when α = 0, then our discount
function fA never increases and our algorithm behaves like CMA-ES. However, when α = 1, then our discount function
always maintains the best solution for each region in measure space and our algorithm behaves like CMA-ME. For
0 < α < 1, CMA-MAE smoothly blends between the two algorithms’ behavior, allowing for an algorithm that spends
more time on the optimization of f before transitioning to exploration. Figure 3.13 is an illustrative example of varying
the learning rate α.
Our proposed annealing method naturally addresses the flat objective limitation. Observe that both CMA-ES and
CMA-ME struggle on flat objectives f as the natural gradient becomes 0 in this case and each algorithm will restart.
However, we show that, when CMA-MAE optimizes f − fA for 0 < α < 1, the algorithm becomes a descent method
on the density histogram defined by the archive.
Finally, CMA-ME’s poor performance on low resolution archives is likely caused by the non-stationary objective
f − fA changing too quickly for the adaptation mechanism to keep up. Our archive learning rate α controls how
quickly f − fA changes. We derive a conversion formula for α that allows us to derive equivalent α for different
archive resolutions. Our conversion formula guarantees that CMA-MAE is the first QD algorithm invariant to archive
resolution.
While our new annealing method benefits CMA-ME, our approach is also compatible with CMA-ME’s differentiable
quality diversity (DQD) counterpart, CMA-MEGA [87]. We apply the same modification in the DQD setting to form
74
CMA-MAEGA. To evaluate CMA-MAEGA, we improve the latent space illumination (LSI) [87] domain by introducing
a higher-dimensional domain based on StyleGAN2, capable of producing higher quality images. This new domain
highlights the advantages of DQD in high-dimensional spaces and demonstrates the performance benefits of our
annealing method.
Overall, our work shows how a simple algorithmic change to CMA-ME addresses all three major limitations
affecting CMA-ME’s performance and robustness. Our theoretical findings justify the aforementioned properties and
inform our experiments, which show that CMA-MAE outperforms state-of-the-art QD algorithms and maintains robust
performance across different archive resolutions.
3.3.2 Problem Definition
Quality Diversity. We adopt the quality diversity (QD) problem definition from prior work [87]. A QD problem consists
of an objective function f : R
n → R that maps n-dimensional solution parameters to the scalar value representing the
quality of the solution and k measure functions mi
: R
n → R or, as a vector function, m : R
n → R
k
that quantify the
behavior or attributes of each solution.‡ The range of m forms a measure space S = m(R
n). The QD objective is to
find a set of solutions θ ∈ R
n, such that m(θ) = s for each s in S and f(θ) is maximized.
The measure space S is continuous, but solving algorithms need to produce a finite collection of solutions. Therefore,
QD algorithms in the MAP-Elites [202, 53] family relax the QD objective by discretizing the space S. Given T as the
tessellation of S into M cells, the QD objective becomes to find a solution θi for each of the i ∈ {1, . . . , M} cells, such
that each θi maps to the cell corresponding to m(θi) in the tesselation T. The QD objective then becomes maximizing
the objective value f(θi) of all cells:
maxX
M
i=1
f(θi) (3.8)
The differentiable quality diversity (DQD) problem [87] is a special case of the QD problem where both the objective
f and measures mi are first-order differentiable.
‡
In agent-based settings, such as reinforcement learning, the measure functions are sometimes called behavior functions and the outputs of each
measure function are called behavioral characteristics or behavior descriptors.
75
3.3.3 Preliminaries
We present several QD algorithms that solve derivative-free QD problems to provide context for our proposed
CMA-MAE algorithmm, and its differentiable quality diversity (DQD) counterpart CMA-MEGA, which solves problems where the gradients of the objective and measure functions are available.
MAP-Elites and MAP-Elites (line). The MAP-Elites QD algorithm produces an archive of solutions, where each cell
in the archive corresponds to the provided tesselation T in the QD problem definition. The algorithm initializes the
archive by sampling solutions from the solution space R
n from a fixed distribution. After initialization, MAP-Elites
produces new solutions by selecting occupied cells uniformly at random and perturbing them with isotropic Gaussian
noise: θ
′ = θi + σN (0, I). For each new candidate solution θ
′
, the algorithm computes an objective f(θ
′
) and
measures m(θ
′
). MAP-Elites places θ
′
into the archive if the cell corresponding to m(θ
′
) is empty or θ
′ obtains a
better objective value f(θ
′
) than the current occupant. The MAP-Elites algorithm results in an archive of solutions
that are diverse with respect to the measure function m, but also high quality with respect to the objective f. Prior
work [275] proposed the MAP-Elites (line) algorithm by augmenting the isotropic Gaussian perturbation with a linear
interpolation between two solutions θi and θj : θ
′ = θi + σ1N (0, I) + σ2N (0, 1)(θi − θj ).
CMA-ME. Covariance Matrix Adaptation MAP-Elites [91] combines the archiving mechanisms of MAP-Elites with
the adaptation mechanisms of CMA-ES [123]. Instead of perturbing archive solutions with Gaussian noise, CMA-ME
maintains a multivariate Gaussian of search directions N (0, Σ) and a search point θ ∈ R
n. The algorithm updates the
archive by sampling λ solutions around the current search point θi ∼ N (θ, Σ). After updating the archive, CMA-ME
ranks solutions via a two stage ranking. Solutions that discover a new cell are ranked by the objective ∆i = f(θi),
and solutions that map to an occupied cell e are ranked by the improvement over the incumbent solution θe in that
cell: ∆i = f(θi) − f(θe). CMA-ME prioritizes exploration by ranking all solutions that discover a new cell before
all solutions that improve upon an existing cell. Finally, CMA-ME moves θ towards the largest improvement in the
archive, according to the CMA-ES update rules. Prior work [87] showed that the improvement ranking of CMA-ME
approximates a natural gradient of a modified QD objective (see Eq. 3.8).
3.3.4 Proposed Algorithms
We present the CMA-MAE algorithm and its DQD counterpart CMA-MAEGA.
76
CMA-ES CMA-MAE CMA-ME
Figure 3.14: Our proposed CMA-MAE algorithm smoothly blends between the behavior of CMA-ES and CMA-ME
via an archive learning rate α. Each heatmap visualizes an archive of solutions across a 2D measure space, where the
color of each cell represents the objective value of the solution.
CMA-MAE. CMA-MAE is an algorithm that adjusts the rate the non-stationary QD objective f − fA changes. First,
consider at a high level how CMA-ME explores the measure space and discovers high quality solutions. The CMA-ME
algorithm maintains a solution point θ and an archive A with previously discovered solutions. When CMA-ME samples
a new solution θ
′
, the algorithm computes the solution’s objective value f(θ
′
) and maps the solution to a cell e in
the archive based on the measure m(θ
′
). CMA-ME then computes the improvement of the objective value f(θ
′
) of
the new solution, over a discount function fA : R
n → R. In CMA-ME, we define fA(θ
′
) by computing the cell e in
the archive corresponding to m(θ
′
) and letting fA(θ
′
) = f(θe), where θe is the incumbent solution of cell e. The
algorithm ranks candidate solutions by improvement f(θ
′
) − fA(θ
′
) = f(θ
′
) − f(θe) and moves the search in the
direction of higher ranked solutions.
Assume that CMA-ME samples a new solution θ
′ with a high objective value of f(θ
′
) = 99. If the current
occupant θe of the corresponding cell has a low objective value of f(θe) = 0.3, then the improvement in the archive
∆ = f(θ
′
) − f(θe) = 98.7 is high and the algorithm will move the search point θ towards θ
′
. Now, assume that in
the next iteration the algorithm discovers a new solution θ
′′ with objective value f(θ
′′) = 100 that maps to the same
cell as θ
′
. The improvement then is ∆ = f(θ
′′) − f(θ
′
) = 1 as θ
′
replaced θe in the archive in the previous iteration.
CMA-ME would likely move θ away from θ
′′ as the solution resulted in low improvement. In contrast, CMA-ES would
move towards θ
′′ as it ranks only by the objective f, ignoring previously discovered solutions with similar measure
values.
77
In the above example, CMA-ME moves away from a high performing region in order to maximize how the archive
changes. However, in domains with hard-to-optimize objective functions, it is beneficial to perform more optimization
steps towards the objective f before leaving each high-performing region [267].
Like CMA-ME, CMA-MAE maintains a discount function fA(θ
′
) and ranks solutions by improvement f(θ
′
) −
fA(θ
′
). However, instead of maintaining an elitist archive by setting fA(θ
′
) equal to f(θe), we maintain a soft archive
by setting fA(θ
′
) equal to te, where te is an acceptance threshold maintained for each cell in the archive A. When
adding a candidate solution to the archive, we control the rate that te changes by the archive learning rate α as follows:
te ← (1 − α)te + αf(θ
′
).
The archive learning rate α in CMA-MAE allows us to control how quickly we leave a high-performing region of
measure space. For example, consider discovering solutions in the same cell with objective value 100 in 5 consecutive
iterations. The improvement values computed by CMA-ME against the elitist archive would be 100, 0, 0, 0, 0, thus
CMA-ME would move rapidly away from this cell. The improvement values computed against the soft archive of
CMA-MAE with α = 0.5 would diminish smoothly as follows: 100, 50, 25, 12.5, 6.25, enabling further exploitation of
the high-performing region.
Next, we walk through the CMA-MAE algorithm step-by-step. Algorithm 7 shows the pseudo-code for CMA-MAE
with the differences from CMA-ME highlighted in yellow. First, on line 57 we initialize the acceptance threshold
to minf . In each iteration we sample λ solutions around the current search point θ (line 60). For each candidate
solution θi, we evaluate the solution and compute the objective value f(θi) and measure values m(θi) (line 61).
Next, we compute the cell e in the archive that corresponds to the measure values and the improvement ∆i over the
current threshold te (lines 62-63). If the objective crosses the acceptance threshold te, we replace the incumbent θe in
the archive and increase the acceptance threshold te (lines 64-65). Next, we rank all candidate solutions θi by their
improvement ∆i
. Finally, we step our search point θ and adapt our covariance matrix Σ towards the direction of largest
improvement (lines 68-69) according to CMA-ES’s update rules [123].
CMA-MAEGA. We note that our augmentations to the CMA-ME algorithm only affects how we replace solutions in
the archive and how we calculate ∆i
. Both CMA-ME and CMA-MAEGA replace solutions and calculate ∆i
identically,
so we apply the same augmentations from CMA-ME to CMA-MEGA to form a new DQD algorithm, CMA-MAEGA.
Algorithm 8 shows the pseudo-code for CMA-MAEGA with the differences from CMA-MEGA highlighted in yellow.
78
Algorithm 7 Covariance Matrix Adaptation MAP-Annealing
CMA-MAE (evaluate, θ0, N, λ, σ, minf , α)
input :An evaluation function evaluate that computes the objective and measures, an initial solution θ0, a desired
number of iterations N, a branching population size λ, an initial step size σ, a minimal acceptable solution
quality minf , and an archive learning rate α.
result :Generate Nλ solutions storing elites in an archive A.
56 Initialize solution parameters θ to θ0, CMA-ES parameters Σ = σI and p, where we let p be the CMA-ES internal
parameters.
57 Initialize the archive A and the acceptance threshold te with minf for each cell e.
58 for iter ← 1 to N do
59 for i ← 1 to λ do
60 θi ∼ N (θ, Σ)
61 f,m ← evaluate(θi)
62 e ← calculate cell(A,m)
63 ∆i ← f − te
64 if f > te then
65 Replace the current occupant in cell e of the archive A with θi te ← (1 − α)te + αf
66 end
67 end
68 rank θi by ∆i
69 Adapt CMA-ES parameters θ, Σ, p based on improvement ranking ∆i
70 if CMA-ES converges then
71 Restart CMA-ES with Σ = σI.
72 Set θ to a randomly selected existing cell θi from the archive
73 end
74 end
3.3.5 Theoretical Properties of CMA-MAE
We provide insights about the behavior of CMA-MAE for different α values.
Theorem 3.3.5.1. The CMA-ES algorithm is equivalent to CMA-MAE when α = 0, if CMA-ES restarts from an archive
solution.
Proof. CMA-ES and CMA-MAE differ only on how they rank solutions. CMA-ES ranks solutions purely based on the
objective f, while CMA-MAE ranks solutions by f − te, where te is the acceptance threshold initialized by minf .
Thus, to show that CMA-ES is equivalent to CMA-MAE for α = 0, we only need to show that they result in identical
rankings.
In CMA-MAE, te is updated as follows: te ← (1 − α)te + αf. For α = 0, te = minf is invariant for the whole
algorithm: te ← 1te + 0f = te. Therefore, CMA-MAE ranks solutions based on f − minf . However, comparisonbased sorting is invariant to order-preserving transformations of the values being sorted [123]. Thus, CMA-ES and
CMA-MAE rank solutions identically.
79
Algorithm 8 Covariance Matrix Adaptation MAP-Annealing via a Gradient Arborescence (CMA-MAEGA)
CMA-MAEGA (evaluate, θ0, N, λ, η, σg, minf , α)
input :An evaluation function evaluate that computes the objective, the measures, and the gradients of the objective
and measures, an initial solution θ0, a desired number of iterations N, a branching population size λ, a
learning rate η, an initial step size for CMA-ES σg, a minimal acceptable solution quality minf , and an
archive learning rate α.
result :Generate N(λ + 1) solutions storing elites in an archive A.
1 Initialize solution parameters θ to θ0, CMA-ES parameters µ = 0, Σ = σgI, and p, where we let p be the CMA-ES
internal parameters.
2 Initialize the archive A and the acceptance threshold te with minf for each cell e.
3 for iter ← 1 to N do
4 f, ∇f ,m, ∇m ← evaluate(θ)
5 ∇f ← normalize(∇f ), ∇m ← normalize(∇m)
6 if f > te then
7 Replace the current elite in cell e of the archive A with θi
8 te ← (1 − α)te + αf
9 end
10 for i ← 1 to λ do
11 c ∼ N (µ, Σ)
12 ∇i ← c0∇f +
Pk
j=1 cj∇mj
13 θ
′
i ← θ + ∇i
14 f
′
, ∗,m′
, ∗ ← evaluate(θ
′
i
)
15 ∆i ← f
′ − te
16 if f
′ > te then
17 Replace the current occupant in cell e of the archive A with θi
18 te ← (1 − α)te + αf′
19 end
20 end
21 rank ∇i by ∆i
22 ∇step ←
Pλ
i=1 wi∇rank[i]
23 θ ← θ + η∇step
24 Adapt CMA-ES parameters µ, Σ, p based on improvement ranking ∆i
25 if there is no change in the archive then
26 Restart CMA-ES with µ = 0, Σ = σgI.
27 Set θ to a randomly selected existing cell θi from the archive
28 end
29 end
The next theorem states that CMA-ME is equivalent to CMA-MAE when α = 1 with the following caveats: First,
we assume that CMA-ME restarts only by the CMA-ES restart rules, rather than the additional “no improvement”
restart rule in prior work [91]. Second, we assume that both CMA-ME and CMA-MAE leverage µ selection [123]
rather than filtering selection [91].
Lemma 3.3.5.2. During execution of the CMA-MAE algorithm with α = 1, the threshold te is equal to f(θe) for cells
that are occupied by a solution θe and to minf for all empty cells.
80
Proof. We will prove the lemma by induction. All empty cells are initialized with te = minf , satisfying the basis step.
Then, we will show that if the statement holds after k archive updates, it will hold after a subsequent update k + 1.
Assume that at step k we generate a new solution θi mapped to a cell e. We consider two cases:
Case 1: The archive cell e is empty. Then, f(θi) > minf and both CMA-ME and CMA-MAE will place θi in the
archive as the new cell occupant θe. The threshold te is updated as te = (1 − α)te + αf(θe) = 0minf + 1f(θe) =
f(θe).
Case 2: The archive cell e contains an incumbent solution θe. Then, either f(θi) ≤ f(θe) or f(θi) > f(θe).
If f(θi) ≤ f(θe), then the archive does not change and the inductive step holds via the inductive hypothesis. If
f(θi) > f(θe), then θi becomes the new cell occupant θe and te is updated as te = (1 − α)te + αf(θe) =
0te + 1f(θe) = f(θe).
Theorem 3.3.5.3. The CMA-ME algorithm is equivalent to CMA-MAE when α = 1 and minf is an arbitrarily large
negative number.
Proof. Both CMA-ME and CMA-MAE rank candidate solutions θi based on improvement values ∆i
. While CMA-ME
and CMA-MAE compute ∆i differently, we will show that for α = 1, the rankings are identical for the two algorithms.
We assume a new candidate solution mapped to a cell e. We describe first the computation of ∆i for CMA-ME.
CMA-ME ranks solutions that discover an empty cell based on their objective value. Thus, if θi discovers an empty
cell, ∆i = f(θi). On the other hand, if θi is mapped to a cell occupied by another solution θe, it will rank θi based
on the improvement ∆i = f(θi) − f(θe). CMA-ME performs a two-stage ranking, where it ranks all solutions that
discover empty cells before solutions that improve occupied cells.
We now show the computation of ∆i for CMA-MAE with α = 1. If θi discovers an empty cell ∆i = f(θi) − te
and by Lemma 3.3.5.2 ∆i = f(θi)−minf . If θi is mapped to a cell occupied by another solution θe, ∆i = f(θi)−te
and by Lemma 3.3.5.2 ∆i = f(θi) − f(θe).
Comparing the values ∆i between the two algorithms we observe the following: (1) If θi discovers an empty
cell, ∆i = f(θi) − minf for CMA-MAE. However, minf is a constant and comparison-based sorting is invariant to
order preserving transformations [123], thus ranking by ∆i = f(θi) − minf is identical to ranking by ∆i = f(θi)
performed by CMA-ME. (2) If θi is mapped to a cell occupied by another solution θe, ∆i = f(θi) − f(θe) for
both algorithms. (3) Because minf is an arbitrarily large negative number f(θi) − minf > f(θi) − f(θe). Thus,
81
CMA-MAE will always rank solutions that discover empty cells before solutions that are mapped to occupied cells,
identically to CMA-ME.
We next provide theoretical insights on how the discount function fA smoothly increases from a constant function
minf to the discount function used by CMA-ME, as α increases from 0 to 1. We focus on the special case of a fixed
sequence of candidate solutions.
Theorem 3.3.5.4. Let αi and αj be two archive learning rates for archives Ai and Aj such that 0 ≤ αi < αj ≤ 1. For
two runs of CMA-MAE that generate the same sequence of m candidate solutions {S} = θ1, θ2, ..., θm, it follows that
fAi
(θ) ≤ fAj
(θ) for all θ ∈ R
n.
Proof. We prove the theorem via induction over the sequence of solution additions. fA is the histogram formed by the
thresholds te over all archive cells e in the archive. Thus, we prove fAi ≤ fAj
by showing that te(Ai) ≤ te(Aj ) for all
archive cells e after m archive additions.
As a basis step, we note that Ai equals Aj as both archives are initialized with minf .
Our inductive hypothesis states that after k archive additions we have te(Ai) ≤ te(Aj ), and we need to show that
te(Ai) ≤ te(Aj ) after solution θk+1 is added to each archive.
Our solution θk+1 has three cases with respect to the acceptance thresholds:
Case 1: f(θk+1) ≤ te(Ai) ≤ te(Aj ). The solution is not added to either archive and our property holds from the
inductive hypothesis.
Case 2: te(Ai) ≤ f(θk+1) ≤ te(Aj ). The solution is added to Ai
, but not Aj , thus t
′
e
(Aj ) = te(Aj ). We follow
the threshold update: t
′
e
(Ai) = (1−αi)te(Ai) +αif(θk+1). Next, we need to show that t
′
e
(Ai) ≤ t
′
e
(Aj ) to complete
the inductive step:
(1 − αi)te(Ai) + αif(θk+1) ≤ f(θk+1) ⇐⇒
(1 − αi)te(Ai) ≤ (1 − αi)f(θk+1) ⇐⇒
te(Ai) ≤ f(θk+1) as 1 − αi ≥ 0
82
The last inequality holds true per our initial assumption for Case 2. From the inductive hypothesis, we have
f(θk+1) ≤ te(Aj ) = t
′
e
(Aj ).
Case 3: te(Ai) ≤ te(Aj ) ≤ f(θk+1). The solution θk+1 is added to both archives. We need to show that
t
′
e
(Ai) ≤ t
′
e
(Aj ):
t
′
e
(Ai) ≤ t
′
e
(Aj ) ⇐⇒
(1 − αi)te(Ai) + αif(θk+1) ≤ (1 − αj )te(Aj ) + αjf(θk+1) (3.9)
We can rewrite Eq. 3.9 as:
(1 − αj )te(Aj ) − (1 − αi)te(Ai) + αjf(θk+1) − αif(θk+1) ≥ 0 (3.10)
First, note that:
(1 − αj )te(Aj ) − (1 − αi)te(Ai) ≥ (1 − αj )te(Ai) − (1 − αi)te(Ai)
= (1 − αj − 1 + αi)te(Ai)
= (αi − αj )te(Ai).
Thus:
(1 − αj )te(Aj ) − (1 − αi)te(Ai) ≥ (αi − αj )te(Ai) (3.11)
From Eq. 3.10 and 3.11 we have:
(1 − αj )te(Aj ) + αjf(θk+1) − (1 − αi)te(Ai) − αif(θk+1)
≥ (αi − αj )te(Ai) + (αj − αi)f(θk+1)
= (αj − αi)(f(θk+1) − te(Ai))
83
As αj > αi and f(θk+1) ≥ te(Ai), we have (αj − αi)(f(θk+1) − te(Ai)) ≥ 0. This completes the proof that
Eq. 3.10 holds.
As all cases in our inductive step hold, our proof by induction is complete.
Finally, we wish to address the limitation that CMA-ME performs poorly on flat objectives, where all solutions
have the same objective value. Consider how CMA-ME behaves on a flat objective f(θ) = C for all θ ∈ R
n, where C
is an arbitrary constant. CMA-ME will only discover each new cell once and will not receive any further feedback
from that cell, since any future solution cannot replace the incumbent elite. This hinders the CMA-ME’s movement in
measure space, which is based on feedback from changes in the archive. Future candidate solutions will only fall into
occupied cells, triggering repeated restarts caused by CMA-ES’s restart rule.
When the objective function plateaus, we still want CMA-ME to perform well and benefit from the CMA-ES
adaptation mechanisms. One reasonable approach would be to keep track of the frequency oe that each cell e has been
visited in the archive, where oe represents the number of times a solution was generated in that cell. Then, when a
flat objective occurs, we rank solutions by descending frequency counts. Conceptually, CMA-ME would descend the
density histogram defined by the archive, pushing the search towards regions of the measure space that have been
sampled less frequently. Theorem. 3.3.5.6 shows that we obtain the density descent behavior on flat objectives without
additional changes to the CMA-MAE algorithm.
To formalize our theory, we define the approximate density descent algorithm that is identical to CMA-ME, but
differs by how solutions are ranked. Specifically, the algorithm maintains a histogram of occupancy counts oe for each
cell e, with oe representing the number of times a solution was generated in that cell. This algorithm descends the
density histogram in measure space by ranking solutions based on the occupancy count of the cell that the solution
maps to, where solutions that discover less frequently visited cells are ranked higher than more frequently visited cells.
We will prove CMA-MAE and the approximate density descent algorithm are equivalent for flat objectives.
Lemma 3.3.5.5. The threshold te after k additions to cell e forms a strictly increasing sequence for a constant objective
function f(θ) = C for all θ ∈ R
n, when 0 < α < 1 and minf < C.
Proof. To show that te after k additions to cell e forms a strictly increasing sequence, we write a recurrence relation
for te after k solutions have been added to cell e. Let te(k) = (1 − α)te(k − 1) + αf(θi) and te(0) = minf be that
84
recurrence relation. To show the recurrence is an increasing function, we need to show that te(k) > te(k − 1) for all
k ≥ 0.
We prove the inequality via induction over cell additions k. As a basis step, we show te(1) > te(0): (1 − α)minf +
αC > minf ⇐⇒ minf − minf − α · minf + αC ⇐⇒ αC > α · minf . As C > minf and α > 0, the basis
step holds.
For the inductive step, we assume that te(k) > te(k − 1) and need to show that te(k + 1) > te(k): te(k + 1) >
te(k) ⇐⇒ (1 − α)te(k) + αC > (1 − α)te(k − 1) + αC ⇐⇒ (1 − α)te(k) > (1 − α)te(k − 1) ⇐⇒ te(k) >
te(k − 1).
Theorem 3.3.5.6. The CMA-MAE algorithm optimizing a constant objective function f(θ) = C for all θ ∈ R
n is
equivalent to density descent, when 0 < α < 1 and minf < C.
Proof. We will prove that for an arbitrary archive A with both the occupancy count for each cell oe and the threshold
value te computed with arbitrary learning rate 0 < α < 1, CMA-MAE results in the same ranking for an arbitrary batch
of solutions {θi} as the approximate density descent algorithm.
We let θi and θj be two arbitrary solutions in the batch mapped to cells ei and ej . Without of loss of generality, we
let oei ≤ oej
. The approximate density descent algorithm will thus rank θi before θj . We will show that CMA-MAE
results in the same ranking.
If oei ≤ oej
, and since te is a strictly increasing function from Lemma 3.3.5.5: tei
(oei
) ≤ tej
(oej
). We have
tei
(oei
) ≤ tej
(oej
) ⇐⇒ C − tei
(oei
) ≥ C − tej
(oej
). Thus, the archive improvement by adding θi to the archive
is larger than the improvement by adding θj and CMA-MAE will rank θi higher than θj , identically with density
descent.
We highlight that the proof of Theorem 3.3.5.6 is based on two critical properties. First, the threshold update rule
forms a strictly increasing sequence of thresholds for each cell. Second, CMA-ES is invariant to order preserving
transformations of its objective f. While we have proposed the update rule of line 65 of Algorithm 7, we note that any
update rule that satisfies the increasing sequence property retains the density descent property and is thus applicable in
CMA-MAE.
85
LP (sphere) LP (Rastrigin) LP (plateau) Arm Repertoire LSI (StyleGAN) LSI (StyleGAN2)
Algorithm QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage
MAP-Elites 41.64 50.80% 31.43 47.88% 47.07 47.07% 71.40 74.09% 12.85 19.42% -936.96 4.48%
MAP-Elites (line) 49.07 60.42% 38.29 56.51% 52.20 52.20% 74.55 75.61% 14.40 21.11% -236.65 8.81%
CMA-ME 36.50 42.82% 38.02 53.09% 34.54 34.54% 75.82 75.89% 14.00 19.57% — —
CMA-MAE 64.86 83.31% 52.65 80.46% 79.27 79.29% 79.03 79.24% 17.67 25.08% — —
Table 3.5: Mean QD-score and coverage values after 10,000 iterations for each QD algorithm per domain.
While Theorem 3.3.5.6 assumes a constant objective f, we conjecture that the theorem holds true generally when
threshold te in each cell e approaches the local optimum within the cell boundaries.
Conjecture 3.3.5.7. The CMA-MAE algorithm becomes equivalent to the density descent algorithm for a subset of
archive cells for an arbitrary convex objective f, where the cardinality of the subset of cells increases as the number of
iterations increases.
We provide intuition for our conjecture through the lense of the elite hypervolume hypothesis [275]. The elite
hypervolume hypothesis states that optimal solutions for the MAP-Elites archive form a connected region in search
space. Later work [232] connected the elite hypervolume hypothesis to the manifold hypothesis [79] in machine
learning, stating that the elite hypervolume can be represented by a low dimensional manifold in search space.
For our conjecture, we assume that the elite hypervolume hypothesis holds and there exists a smooth manifold that
represents the hypervolume. Next, we assume in the conjecture that f is an arbitrary convex function. As f is convex,
early in the CMA-MAE search the discount function fA will be flat and the search point θ will approach the global
optimum following CMA-ES’s convergence properties [128, 126], where the precision of convergence is controlled by
archive learning rate α. By definition, the global optimum θ
∗
is within the elite hypervolume as no other solution of
higher quality exists within its archive cell. Assuming the elite hypervolume hypothesis holds, a subset of adjacent
solutions in search space will also be in the hypervolume due to the connectedness of the hypervolume. As fA increases
around the global optimum, we conjecture that the function f(θ
∗
) − fA(θ
∗
) will form a plateau around the optimum,
since it will approach the value f(θi) − fA(θi) of adjacent solutions θi. By Theorem 3.3.5.6 we have a density descent
algorithm within the plateau, pushing CMA-MAE to discover solutions on the frontier of the known hypervolume.
Finally, we remark that our conjecture implies that f − fA tends towards a constant function in the limit, resulting
in a density descent algorithm across the elite hypervolume manifold as the number of generated solutions approaches
infinity. We leave a formal proof of this conjecture for future work.
86
LP (sphere) LP (Rastrigin) LP (plateau) Arm Repertoire LSI (StyleGAN) LSI (StyleGAN2)
Algorithm QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage
CMA-MEGA 75.32 100.00% 63.07 100.00% 100.00 100.00% 75.21 75.25% 16.08 22.58% 9.17 14.91%
CMA-MAEGA 75.39 100.00% 63.06 100.00% 100.00 100.00% 79.27 79.35% 16.20 23.83% 11.51 18.62%
Table 3.6: Mean QD-score and coverage values after 10,000 iterations for each DQD algorithm per domain.
3.3.6 Experiments
We compare the performance of CMA-MAE with the state-of-the-art QD algorithms MAP-Elites, MAP-Elites (line),
and CMA-ME, using existing Pyribs [266] QD library implementations. We set α = 0.01 for CMA-MAE and include
additional experiments for varying α in section 3.3.7. Because annealing methods replace solutions based on the
threshold, we retain the best solution in each cell for comparison purposes. We include additional comparisons between
CMA-MEGA and CMA-MAEGA – the DQD counterpart to CMA-MAE.
We select the benchmark domains from prior work [87]: linear projection [91], arm repertoire [54], and latent space
illumination [89]. To evaluate the good exploration properties of CMA-MAE on flat objectives, we introduce a variant
of the linear projection domain to include a “plateau” objective function that is constant everywhere for solutions within
a fixed range and has a quadratic penalty for solutions outside the range.
We additionally introduce a second LSI experiment on StyleGAN2 [159], configured by insights from the generative
art community [49, 93] that improve the quality of single-objective latent space optimization. To improve control
over image synthesis, the LSI (StyleGAN2) domain optimizes the full 9216-dimensional latent w-space, rather than a
compressed 512-dimensional latent space in the LSI (StyleGAN) experiments. We exclude CMA-ME and CMA-MAE
from this domain due to the prohibitive size of the covariance matrix. The LSI (StyleGAN2) domain allows us to
evaluate the performance of DQD algorithms on a much more challenging DQD domain than prior work.
3.3.6.1 Experiment Design
Independent Variables. We follow a between-groups design with two independent variables: the algorithm and the
domain.
Dependent Variables. We use the sum of f values of all cells in the archive, defined as the QD-score [230], as a metric
for the quality and diversity of solutions. Following prior work [87], we normalize the QD-score metric by the archive
size (the total number of cells from the tesselation of measure space) to make the metric invariant to archive resolution.
87
CMA-MAE CMA-ME
0
100
,WHUDWLRQV
4'6FRUH
&0$0$(
&0$0(
0$3(OLWHVOLQH
0$3(OLWHV
CMA-MAE CMA-ME
0
100
,WHUDWLRQV
4'6FRUH
CMA-MAE CMA-ME
0
100
,WHUDWLRQV
4'6FRUH
Figure 3.15: QD-score plot with 95% confidence intervals and heatmaps of generated archives by CMA-MAE and
CMA-ME for the linear projection sphere (top), plateau (middle), and arm repertoire (bottom) domains. Each heatmap
visualizes an archive of solutions across a 2D measure space.
We additionally compute the coverage, defined as the number of occupied cells in the archive divided by the total
number of cells.
88
-0.11 -0.07 -0.04 0.0 0.04 0.07 0.11
Age.
0.11
0.07
0.04
0.0
-0.04
-0.07
-0.11
Hair Length.
Figure 3.16: A latent space illumination collage for the objective “A photo of the face of Tom Cruise.” with hair length
and age measures sampled from a final CMA-MAEGA archive for the LSI (StyleGAN2) domain.
3.3.6.2 Analysis
Derivative-free QD Algorithms. Table 3.5 shows the QD-score and coverage values for each algorithm and domain,
averaged over 20 trials for the linear projection (LP) and arm repertoire domains and over 5 trials for the LSI domain.
Fig. 3.15 shows the QD-score values for increasing number of iterations and example archives for CMA-MAE and
CMA-ME with 95% confidence intervals.
We conducted a two-way ANOVA to examine the effect of the algorithm and domain on the QD-score. There was a
significant interaction between the algorithm and the domain (F(12, 320) = 1958.34, p < 0.001). Simple main effects
analysis with Bonferroni corrections showed that CMA-MAE outperformed all derivative-free QD baselines in all
benchmark domains.
For the arm repertoire domain, we can compute the optimal archive coverage by testing whether each cell overlaps
with a circle of radius equal to the maximum arm length. We observe that CMA-MAE approaches the computed optimal
coverage 80.24% for a resolution of 100 × 100.
89
DQD Algorithms. We additionally compare CMA-MEGA and CMA-MAEGA in the five benchmark domains.
Table 3.6 shows the QD-score and coverage values for each algorithm and domain, averaged over 20 trials for the linear
projection (LP) and arm repertoire domains and over 5 trials for the LSI domains. We conducted a two-way ANOVA to
examine the effect of the algorithm and domain on the QD-score. There was a significant interaction between the search
algorithm and the domain (F(5, 168) = 165.7, p < 0.001). Simple main effects analysis with Bonferroni corrections
showed that CMA-MAEGA outperformed CMA-MEGA in the LP (sphere), arm repertoire, and LSI (StyleGAN2)
domains. There was no statistically significance difference between the two algorithms in the LP (Rastrigin), LP
(plateau), and LSI (StyleGAN) domains.
We attribute the absence of a statistical difference in the QD-score between the two algorithms on the LP (Rastrigin)
and LP (plateau) domains on the perfect coverage obtained by both algorithms. Thus, any differences in QD-score are
based on the objective values of the solutions returned by each algorithm. In LP (plateau), the optimal objective for each
cell is easily obtainable for both methods. The LP (Rastrigin) domain contains many local optima, because of the form
of the objective function. CMA-MEGA will converge to these optima before restarting, behaving as a single-objective
optimizer within each local optimum. Because of the large number of local optima in the domain, CMA-MEGA obtains
a higher QD-score.
In the LSI (StyleGAN) domain, we attribute similar performance between CMA-MEGA and CMA-MAEGA to
the restart rules used to keep each search within the training distribution of StyleGAN. The ill-conditioned latent
space of StyleGAN also explains why CMA-MAE outperforms both DQD algorithms on this domain. Being a natural
gradient optimizer, CMA-MAE is an approximate second-order method, and second-order methods are better suited for
optimizing spaces with ill-conditioned curvature.
On the other hand, in the LSI (StyleGAN2) domain, we regularize the search space by an L2 penalty in latent space,
allowing for a larger learning rate and a basic restart rule for both algorithms, while still preventing drift out of the
training distribution of StyleGAN2. Because of the fewer restarts, CMA-MAEGA can take advantage of the density
descent property, which was shown to improve exploration in CMA-MAE, and outperform CMA-MEGA. Fig. 3.16
shows an example collage generated from the final archive of CMA-MAEGA on the LSI (StyleGAN2) domain. We note
that because StyleGAN2 has a better conditioning on the latent space [159], the model is better suited for first-order
optimization of its latent space, which helps distinguish between the two DQD algorithms.
90
Linear Projection (sphere) Linear Projection (Rastrigin) Linear Projection (plateau) Arm Repertoire
QD-Score
Resolution Resolution Resolution Resolution
Figure 3.17: Final QD-score of each algorithm for 25 different archive resolutions.
LP (sphere) LP (Rastrigin) LP (plateau) Arm Repertoire
α (CMA-MAE) QD-score Coverage QD-score Coverage QD-score Coverage QD-score Coverage
0.000 5.82 6.06% 5.33 6.24% 19.49 19.49% 65.91 66.25%
0.001 62.65 79.36% 47.87 68.10% 77.60 77.68% 78.63 79.07%
0.010 64.86 83.31% 52.65 80.56% 79.27 79.29% 79.03 79.24%
0.100 60.42 76.19% 48.74 72.50% 83.21 83.21% 78.74 78.85%
1.000 37.01 43.50% 37.86 52.82% 34.00 34.00% 75.94 76.01%
Table 3.7: Mean QD metrics after 10,000 iterations for CMA-MAE at different learning rates.
We include runs for MAP-Elites and MAP-Elites (line) on the LSI (StyleGAN2) domain in Table 3.5 for comparison
purposes. In the LSI (StyleGAN2) domain, the two algorithms drift out of distribution and suffer a regularization
penalty that results in negative objective values. We observe that the gap in performance between derivative-free QD
algorithms and DQD algorithms is higher in the LSI (StyleGAN2) domain than in LSI (StyleGAN) domain. This
highlights the benefits of leveraging gradients of the objective and measure functions in high-dimensional search spaces.
3.3.7 On the Robustness of CMA-MAE
Next, we present two studies that evaluate CMA-MAE robustness across two hyperparameters that may affect algorithm
performance: the archive learning rate α and the archive resolution.
Archive Learning Rate. We examine the effect of different archive learning rates on the performance of CMA-MAE in
the linear projection and arm repertoire domains. We vary the learning rate from 0 to 1 on an exponential scale, while
keeping the resolution constant.
Table 3.7 shows that running CMA-MAE with the different 0 < α < 1 values results in similar performance,
showing that CMA-MAE is fairly robust to the exact choice of α value. On the other hand, if α = 0 or α = 1 the
performance drops drastically. Setting α = 1 results in very similar performance with CMA-ME, which supports our
insight from Theorem 3.3.5.3.
Archive Resolution. As noted in prior work [52, 87], quality diversity algorithms in the MAP-Elites family sometimes
perform differently when run with different archive resolutions. For example, in the linear projection domain proposed
91
in prior work [91], CMA-ME outperformed MAP-Elites and MAP-Elites (line) for archives of resolution 500 × 500,
while in this paper we observe that it performs worse for resolution 100 × 100. In this study, we investigate how
CMA-MAE performs at different archive resolutions.
First, we note that the optimal archive learning rate α is dependent on the resolution of the archive. Consider
as an example a sequence of solution additions to two archives A1 and A2 of resolution 100 × 100 and 200 × 200,
respectively. A2 subdivides each cell in A1 into four cells, thus archive A2’s thresholds te should increase at a four times
faster rate than A1. To account for this difference, we compute α2 for A2 via a conversion formula α2 = 1 − (1 − α1)
r
(see derivation in section 3.3.8), where r is the ratio of cell counts between archives A1 and A2. We initialize α1 = 0.01
for A1. In the above example, α2 = 1 − (1 − 0.01)4 = 0.0394.
Fig. 3.17 shows the QD-score of CMA-MAE with the resolution-dependent archive learning rate and the baselines
for each benchmark domain. CMA-ME performs worse as the resolution decreases because the archive changes quickly
at small resolutions, affecting CMA-ME’s adaptation mechanism. On the contrary, MAP-Elites and MAP-Elites (line)
perform worse as the resolution increases due to having more elites to perturb. CMA-MAE’s performance is invariant
to the resolution of the archive.
3.3.8 Derivation of the Conversion Formula for the Archive Learning Rate
In this section, we derive the archive learning rate conversion formula α2 = 1 − (1 − α1)
r mentioned in Section 3.3.7
of the main paper, where r is the ratio between archive cell counts, and α1 and α2 are archive learning rates for two
archives A1 and A2.
Given an archive learning rate α1 for A1, we want to derive an equivalent archive learning rate α2 for A2 that results
in robust performance when CMA-MAE is run with either A1 or A2. A principled way to derive a conversion formula
for α2 is to look for an invariance property that affects the performance of CMA-MAE and that holds when CMA-MAE
generates solutions in archives A1 and A2.
Since CMA-MAE ranks solutions by f − fA, we wish for fA to increase at the same rate in the two archives. Since
fA(θ) = te, where te is the cell that a solution θ maps to, we select the average value of the acceptance thresholds te
over all cells in each archive as our invariant property.
92
We assume an arbitrary sequence of N solution additions θ1, θ2, ..., θN , evenly dispersed across the archive cells.
We then specify te as a function that maps k cell additions to a value te in archive cell e.
§ Equation 3.12 then defines
the average value of te across the archive after N additions to an archive A with M cells.
1
M
X
M
i=1
te
N
M
(3.12)
Then, equation 3.13 defines the invariance we want to guarantee between archives A1 and A2.
1
M1
X
M1
i=1
te
N
M1
=
1
M2
X
M2
i=1
te
N
M2
(3.13)
In Eq. 3.13, we let M1 and M2 the number of cells in archives A1 and A2, and we assume that M1 and M2 divide
N. To solve for a closed form of α2 subject to our invariance, we need a formula for the function te. Similar to
Lemma 3.3.5.5, we represent the function te as a recurrence relation after adding k solutions to cell e of an archive A.
te(0) = minf
te(k) = (1 − α)te(k − 1) + αf(θk) (3.14)
Next, we look to derive a closed form for te(k) for an archive A as a way to manipulate Equation 3.13. However,
solving for te(k) when f is an arbitrary function is difficult, because different regions of the archive will change at
different rates. Instead, we solve for the special case when f(θ) = C and minf < C, where C ∈ R is a constant scalar.
To solve for a closed form of the recurrence te(k), we leverage the recurrence unrolling method [112], allowing us to
guess the closed form in Equation 3.15.
§Here we abuse notation and view te as a function instead of threshold for simplicity and to highlight the connection to the threshold value te.
93
te(1) = (1 − α)te(0) + αC = (1 − α)minf + αC
te(2) = (1 − α)te(1) + αC = (1 − α)[(1 − α)minf + αC] + αC
= (1 − α)
2minf + (1 − α)αC + αC
te(3) = (1 − α)te(2) + αC
= (1 − α)[(1 − α)
2minf + (1 − α)αC + αC] + αC
= (1 − α)
3minf + (1 − α)
2αC + (1 − α)αC + αC
.
.
.
te(k) = (1 − α)
kminf +
k
X−1
i=0
(1 − α)
iαC (3.15)
We recognize the summation in Equation 3.15 as a geometric series. As 0 < α < 1, we rewrite the summation as
follows.
te(k) = (1 − α)
kminf +
k
X−1
i=0
(1 − α)
iαC
= (1 − α)
kminf + αC
1 − (1 − α)
k
1 − (1 − α)
= (1 − α)
kminf + αC
1 − (1 − α)
k
α
= (1 − α)
kminf + C − C(1 − α)
k
= (minf − C)(1 − α)
k + C
= C − (C − minf )(1 − α)
k
(3.16)
Next, we prove that the closed form we guessed is the closed form of the recurrence relation.
Theorem 3.3.8.1. The recurrence relation te(0) = minf and te(k) = (1 − α)te(k − 1) + αC has the closed form
te(k) = C − (C − minf )(1 − α)
k
, where 0 < α < 1 and minf < C.
94
Proof. We show the closed form holds via induction over cell additions k.
As a basis step we show that te(0) = C − (C − minf )(1 − α)
0 = C − (C − minf ) = minf .
For the inductive step, suppose after j insertions into the archive A in cell e our closed form holds. We show that
the closed form holds for j + 1 insertions.
te(j + 1) = (1 − α)te(j) + αC
= (1 − α)[C − (C − minf )(1 − α)
j
] + αC
= C(1 − α) − (C − minf )(1 − α)
j+1 + αC
= C − αC + αC − (C − minf )(1 − α)
j+1
= C − (C − minf )(1 − α)
j+1 (3.17)
As our basis and inductive steps hold, our proof is complete.
The closed form from Theorem 3.3.8.1 allows us to derive a conversion formula for α2 via our invariance formula
in Equation 3.13.
1
M1
X
M1
i=1
te
N
M1
=
1
M2
X
M2
i=1
te
N
M2
M1
M1
C − (C − minf )(1 − α1)
N
M1
=
M2
M2
C − (C − minf )(1 − α2)
N
M2
(C − minf )(1 − α1)
N
M1 = (C − minf )(1 − α2)
N
M2
(1 − α1)
N
M1 = (1 − α2)
N
M2
(1 − α1)
M2 M1 = (1 − α2)
α2 = 1 − (1 − α1)
M2 M1 (3.18)
95
We remark that our conversion formula is not dependent on the number of archive additions N. Although our
conversion formula assumes f to be a constant objective, we conjecture that the formula holds generally for a convex
objective f.
Conjecture 3.3.8.2. The archive learning rate conversion formula results in invariant behavior of CMA-MAE for two
arbitrary archives A1 and A2 with archive resolutions M1 and M2, for a convex objective f.
Our intuition is similar to the intuition behind Conjecture 3.3.5.7, where we assume the elite hypervolume hypothesis
holds [275]. At the beginning of the CMA-MAE search, fA is a constant function and CMA-MAE optimizes for
the global optimum, following the convergence properties of CMA-ES [128, 126]. Eventually, the cells around the
global optimum become saturated and the function f − fA forms a plateau around the global optimum. The invariance
described in Eq. 3.13 implies that the fA1
and fA2 will increase at the same rate within the flat region of the plateau. Let
θp be an arbitrary solution in the plateau and θ
′ be a solution on the frontier of the known hypervolume. The plateau of
each archive Ai expands when the solutions on the frontier of the elite hypervolume achieve a larger f(θ
′
) − fAi
(θ
′
)
than the plateau f(θp) − fAi
(θp). We conjecture that the plateau will expand at the same rate in the two archives as
fA1
and fA2
increase at the same rate for the plateau region, due to our invariance in Eq. 3.13.
We speculate that our conjecture explains why we observe invariant behavior across archive resolutions in the
experiments of Section 3.3.7, even though f is not a constant function in the linear projection and arm repertoire
domains.
3.3.9 Related Work
Quality Diversity Optimization. The predecessor to quality diversity optimization, simply called diversity optimization,
originated with the Novelty Search algorithm [178], which searches for a collection of solutions that are diverse in
measure space. Later work introduced the Novelty Search with Local Competition (NSLC) [179] and MAP-Elites [53,
202] algorithms, which combined single-objective optimization with diversity optimization and were the first QD
algorithms. Since then, several QD algorithms have been proposed, based on a variety of single-objective optimization
methods, such as Bayesian optimization [163], evolution strategies [47, 46, 91], differential evolution [43], and gradient
ascent [87]. Several works have improved selection mechanisms [245, 54], archives [92, 274, 252], perturbation
operators [275, 212], and resolution scaling [92, 51, 115].
96
QD with Gradient Information. Several works combine gradient information with QD optimization without leveraging
the objective and measure gradients directly. For example, in model-based QD optimization [97, 119, 36, 162, 185,
292, 98], prior work [232] trains an autoencoder on the archive of solutions and leverages the Jacobian of the decoder
network to compute the covariance of the Gaussian perturbation. Works in quality diversity reinforcement learning
(QD-RL) [221, 226, 211, 267] approximate a reward gradient or diversity gradient via a critic network, action space
noise, or evolution strategies and incorporate those gradients into a QD-RL algorithm.
Acceptance Thresholds. Our proposed archive learning rate α was loosely inspired by simulated annealing methods [19]
that maintain an acceptance threshold that gradually becomes more selective as the algorithm progresses. The notion of
an acceptance threshold is also closely related to minimal criterion methods in evolutionary computation [177, 29, 28,
258]. Our work differs by both 1) maintaining an acceptance threshold per archive cell rather than a global threshold
and 2) annealing the threshold.
3.3.10 Limitations and Future Work
Our approach introduced two hyperparameters, α and minf , to control the rate that f − fA changes. We observed
that an α set strictly between 0 and 1 yields theoretical exploration improvements and that CMA-MAE is robust with
respect to the exact choice of α. We additionally derived a conversion formula that converts an α1 for a specific archive
resolution to an equivalent α2 for a different resolution. However, the conversion formula still requires practitioners to
specify a good initial value of α1. Future work will explore ways to automatically initialize α, similar to how CMA-ES
automatically assigns internal parameters [123].
CMA-MAE’s DQD counterpart CMA-MAEGA sets a new state-of-the-art in DQD optimization. However,
observing its performance benefits required the more challenging LSI (StyleGAN2) domain. This highlights the need
for more challenging DQD problems to advance research in DQD algorithms, since many of the current benchmark
domains can be solved optimally by existing algorithms.
Quality diversity optimization is a rapidly growing branch of stochastic optimization with applications in generative
design [118, 98, 97], automatic scenario generation in robotics [90, 88, 85], reinforcement learning [221, 226, 211, 267],
damage recovery in robotics [53], and procedural content generation [113, 89, 292, 70, 165, 259, 244, 243, 21]. Our
paper introduces a new quality diversity algorithm, CMA-MAE. Our theoretical findings inform our experiments, which
97
show that CMA-MAE addresses three major limitations affecting CMA-ME, leading to state-of-the-art performance
and robustness.
98
Chapter 4
Searching Generative Models of Scenarios
4.1 Introduction
Algorithms that procedurally generate content often need to adhere to a desired style or aesthetics. For example,
generative adversarial networks (GANs) [110, 157] generate realistic looking images after training on a large dataset
of human specified examples. At the same time, for these algorithms to be useful in practice, they need to enable
generation of a diverse range of content, across a range of attributes specified by a human designer. For a GAN, this
requires either sifting through thousands of randomly generated examples, which is cost-prohibitive, or controlling the
GAN output by “steering” it in latent space towards a desired distribution, which is a challenging problem [140].
When desired attributes can be formulated as an objective, one approach is to explore the latent space using
derivative-free optimization algorithms such as CMA-ES [123]. Prior work [25] named this approach latent variable
evolution (LVE). Later work [280] proposed using GANs to automatically author Mario levels and demonstrated how
LVE can extract level scenes with specific attributes from latent space.
The LVE approach is limited to attributes that are easily specifiable as an objective. A human designer may not
know exactly what kind of content they want, but instead have some intuition on how they would vary content when
exploring GAN generated levels. For example, the designer may want to have levels that are of varying difficulty; while
it is hard to specify difficulty as an objective, a designer can choose from automatically generated levels of different
number of enemies or obstacles.
We call the above problem latent space illumination (LSI). Formally, given an objective function and additional
functions which measure different aspects of gameplay, we want to extract a collection of game scenes that collectively
99
Figure 4.1: Mario scenes returned by the CMA-ME quality diversity algorithm, as they cover the designer-specified
space of two level mechanics: number of enemies and number of tiles above a given height. The color shows the
percentage of the level completed by an A* agent, with red indicating full completion.
satisfy all output combinations of the gameplay measures. For each output combination, the representative scene should
maximize the objective function.
Quality diversity (QD) algorithms [229] are a class of algorithms designed to discover a diverse range of high-quality
solutions with several specialized variants designed to explore continuous search spaces.
Our goal in this chapter is twofold: First, we wish to find out whether QD algorithms are effective in illuminating the
latent space of a GAN, in order to generate high-quality level scenes with a diverse range of desired level characteristics,
while still maintaining stylistic similarity to human-authored examples. Second, we want to compare the state-of-the-art
QD algorithms in this domain and provide quantitative and qualitative results that illustrate their performance.
A large-scale experiment shows that the QD algorithms MAP-Elites, MAP-Elites (line) and CMA-ME significantly
outperform CMA-ES and random search in finding a diverse range of high-quality scenes. Additionally, CMA-ME
outperformed the other tested algorithms in terms of diversity and quality of the returned scenes. We show generated
100
scenes, which exhibit an exciting range of mechanics and aesthetics (Fig. 4.1). A user study shows that the diverse
range of level mechanics translates to different subjective ratings of each scenes’ difficulty and appearance, highlighting
the promise of quality diversity algorithms in generating diverse, high-quality content by searching the latent space of
generative adversarial networks.
4.2 Background
Procedural Content Generation. Procedural content generation (PCG) refers to creating game content algorithmically,
with limited human input [246]. Game content can be any asset (e.g., game mechanics, rules, dialog, models, etc)
required to realize the game for its players. Pioneering work in PCG dates back to the 1980s to address memory
limitations for storing large video game levels on computers. The growing interest in realistic graphics in the 1990’s
necessitated the development of procedural modelling algorithms [250] to generate complex models such as trees
and terrain to ease the burden on graphic artists. Much PCG research in both industry and academia has focused
on generating playable levels. In general, the problem of generating content that fulfils certain constraints can be
approached by evolutionary solutions [270] or constraint satisfaction methods [251]. An emerging area of research
is PCG via machine learning (PCGML) which aims to leverage recent advancements in machine learning (ML) to
generate new content by treating existing human authored content as training data [263]. Previous work in PCGML has
enabled automatic generation of video game levels for the Super Mario Bros. using LSTMs [262], Markov Chains [255]
and probabilistic graphical models [117].
Two advancements in PCGML [280, 104] who independently demonstrated the successful application of generative
adversarial networks (GANs) to generate playable video game levels in an unsupervised way from existing video game
level corpora. One approach [280] adapted the concept of latent variable evolution (LVE) [25] to extract Mario scenes
from the latent space of a GAN that targeted specific gameplay features. That work searched the latent space of the
GAN utilizing the popular Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [127] for latent variable inputs
that would make the GAN produce level scenes with desired properties. Scenes with targeted gameplay features were
obtained through carefully crafted single-objective functions, named fitness functions, that carefully balanced weighted
distance from desired gameplay properties on the generated scenes.
101
Quality Diversity Optimization. While the approach employed by prior work [280] demonstrated a promising synergy
between generative models and evolutionary computation for PCG, other works in PCG displayed the potential of
quality diversity (QD) to generate meaningfully diverse video game content [113]. Unlike traditional optimization
methods, QD algorithms aim to generate high quality solutions that differ across specified attributes. Consider the
example of generating Mario levels with specific properties. Instead of incorporating the number of enemies or floor
tiles into the fitness function, a QD algorithm can treat these measures as attributes. The QD algorithm still has the
objective of finding solvable Mario levels, but must find levels that contain all combinations of attributes (number of
enemies, percentage of floor coverage). —Prior work [202] coined the term illumination algorithms for quality diversity
(QD) algorithms that create an organized mapping between solutions and their associated attributes, which are called
behavioral characteristics (BCs). After the QD algorithm generates an organized palette of scenes, stitching algorithms
can combine several scenes together to form a cohesive level [114].
Developed concurrently with our approach is CPPN2GAN [244], which generates full levels for both Super
Mario Bros and Zelda. The paper proposes optimizing the latent space of a GAN with a special type of encoding, a
compositional pattern producing network (CPPN) [257], which captures patterns with regularities. The paper introduces
a type of latent space illumination with a vanilla version of the quality diversity algorithm MAP-Elites [202], described
in the next section. It focuses on simultaneously searching several latent vectors at once to generate a full level created
by “stiching” together GAN-generated scenes. Instead, our focus is on assessing the performance of QD algorithms in
generating a variety of scenes with desired characteristics, and in measuring modern MAP-Elites variants that excel at
the exploration of continuous domains. Our work is also related with conditional generative models [120, 255, 227].
While it is possible to condition GANs on desired BCs, there is no guarantee that the generated scenes will have the
properties matching the conditioning input. Additionally, conditional generative models require retraining for each new
set of BCs a designer wishes to explore, where LSI can search the latent space of the same generative model without
retraining.
MAP-Elites. MAP-Elites [202] is a QD algorithm that searches along a set of explicitly defined attributes called
behavior characteristics (BCs). These attributes collectively form a Cartesian space named the behavior space, which
is tessellated into uniformly spaced grid cells. MAP-Elites maintains the highest performing solution for each cell in
behavior space (an elite) with the product of the algorithm being a diverse archive of high performing solutions. The
102
Figure 4.2: Ground truth scenes 1 (left) and 2 (right) for KL-divergence metric.
archive is initially populated with randomly sampled solutions. The algorithm then generates new solutions by selecting
elites from the archive at random and perturbing each elite with small variations. The objective of the algorithm is both
to expand the archive, maximizing the number of filled cells, and to maximize the quality of the elite within each cell.
How the behavior space is tessellated is the focus of a variety of recent algorithms [252, 92].
MAP-Elites (line). A common characteristic of many tasks is that high-performing solutions that exhibit diverse
behaviors share significant similarities in their “genotype”, that is in their search space parameters. Therefore, Prior
work c[275] propose a variational operator, called “Iso+LineDD” which captures correlations between elites. When
generating a new solution, in addition to applying a random variation to an existing elite, the operator adds a second
random variation directed towards a second elite, essentially nudging the variation distribution towards other high
performing solutions. We denote MAP-Elites with this operator ME (line).
CMA-ES. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a second-order derivative-free optimizer
for single-objective optimization of continuous spaces [123]. The algorithm belongs to a family of algorithms named
evolution strategies (ES), which specialize in optimizing continuous spaces by sampling a population of solutions,
called a generation of solutions, and gradually moving the population towards areas of highest fitness. CMA-ES models
the sampling distribution of the population as a multivariate normal distribution. The algorithm adjusts its sampling
distribution by ranking solutions based on their fitness and estimating a new covariance matrix that maximizes the
likelihood of future successful search steps.
CMA-ME. The Covariance Matrix Adaptation MAP-Elites (CMA-ME) [91] is a recent hybrid algorithm which
incorporates CMA-ES into MAP-Elites. The algorithm improves the efficiency in which new archive cells are
discovered and the overall quality of elites within the archive. CMA-ME maintains a number of individual CMA-ESlike instances, named emitters. We use a specific type of emitter named improvement emitter, which was shown to
outperform MAP-Elites and ME (line) in the strategic card game Hearthstone [91]. Improvement emitters rank solutions
by prioritizing those that fill previously undiscovered cells in the archive. Solutions that belong to existing cells in
103
the map are subsequently ranked based on the improvement in fitness over existing cells. This enables improvement
emitters to dynamically adapt their goals based on feedback from how the archive changes.
4.3 Mario Scene Evaluation
We used the Mario AI Framework* to evaluate each of the generated scenes. We evaluate each scene by treating it as a
playable level; actual levels are longer and can be generated by “stiching” together multiple scenes [114].
Following previous work [280, 12], we approximate playability of a scene by how far through the scene A* reaches;
Specifically, we define as “fitness” of a scene the amount of progress by an AI agent playing the scene (percentage of
completion in the horizontal direction). We use the A* agent that won the 2009 Mario competition.† We additionally
define three different types of behavioral characteristics (BCs), which allow for a diverse set of level mechanics.‡
Representation-Based. We define a set of BCs that capture stylistic aspects of the Mario scene’s representation, based
on the distribution of tiles. These BCs do not depend on the agent’s playthrough:
1. Sky tiles: These are game objects, e.g., blocks, question blocks, coins, that are above a certain height value. A
large number implies that there are many game elements above ground, and the player would need to jump to
higher tiles.
2. Number of enemies: A larger number of enemies generally results in higher difficulty and requires the player to
perform more jumps to navigate throughout the scene.
Agent-Based. We incorporate the agent-based BCs of previous work [165], which are computed after one playthrough
by the agent. The BCs are binary, representing whether the playthrough satisfied a given condition. This results in an
8-dimensional BC-space. The 8 conditions are: (1) performing a jump, (2) performing a high jump (height of jump is
above a certain threshold), (3) performing a long jump (horizontal distance is above a certain threshold), (4) stomping
on an enemy, (5) killing an enemy using a koopa shell, (6) having an enemy die because of falling out of the scene, (7)
collecting a mushroom, and (8) collecting a coin.
KL-Divergence. A common goal in procedural content generation is to generate scenes with different degrees of
stylistic similarity to human-designed examples. We use the tile pattern Kullback–Leibler divergence metric [189] to
*https://github.com/amidos2006/Mario-AI-Framework
†https://www.youtube.com/watch?v=DlkMs4ZHHr8
‡One could also combine the BCs from the three different types, e.g., have an archive with KL-divergence and number of enemies.
104
measure the structural similarity between two Mario scenes. We picked two stylistically different human-designed
scenes from the Mario AI Framework, shown in Fig. 4.2, and we set the behavior characteristics to be the tile pattern
KL-divergence between the ground truth scene and generated scene, resulting in a 2-dimensional BC space.
4.4 Experiments
Our experiments compare the performance of random search, CMA-ES, MAP-Elites, MAP-Elites (line) and CMA-ME
on the problem of latent space illumination.
We ran each of the 5 algorithms for 20 trials, 10,000 evaluations each, for each of the three different BC combinations.
This resulted in a total of 300 trials. We ran all trials in parallel in a university cluster with multiple nodes running on
dual Intel Xeon L5520 processors. Each trial lasted approximately 7 hours.
GAN Model. We use a deep convolutional GAN (DCGAN) as in a study [280], trained with the WGAN algorithm [193].
Following their implementation, we encode the training levels by representing each of the 17 different tile types by
a distinct integer, which is converted to an one-hot encoded vector, before passed as input to the discriminator. We
pad each training level to a 64 × 64 matrix, and since there are 17 channels, one for each possible tile type, each input
scene to the discriminator is 17 × 64 × 64. For the generator, we set the size of the latent vector to be 32, resulting in a
32-dimensional continuous search space. We refer the reader to the study [280] for the details of the architecture.
We train the DCGAN with RMSprop for 5000 iterations, a learning rate of 5e
−5
and a batch size of 32. The
discriminator iterates 5 times before the generator iterates once. We used for training 15 original levels from the Mario
AI competition framework.§ Fig. 4.2 shows scenes from two levels of the training data.
To evaluate the different search algorithms, we input the latent vector of size 32 to the generator, and we crop the
17 × 64 × 64 output to a 17 × 16 × 56 playable level for evaluation.
Search Parameters and Tuning. We tuned each algorithm based on how well it covered the representation-based
behavior space and we then used the same parameters for all three behavioral characteristics. We set population size
λ = 17 and mutation power σ = 0.5 for CMA-ES. A single run of CMA-ME deploys 5 improvement emitters with
λ = 37. We set the mutation power for CMA-ME and MAP-Elites σ = 0.2. For ME (line), we set the isotropic
§https://github.com/amidos2006/Mario-AI-Framework/tree/master/levels/original
105
mutation σ1 = 0.02 and the mutation for the directional distribution σ2 = 0.2. The initial population for MAP-Elites
and ME (line) was 100.
In random search, we generate solutions by sampling directly the GAN’s latent space from the same distribution
that we used to train the generator network: a normal distribution with zero mean and variance equal to 1. We used the
same method to generate solutions for the initial population of MAP-Elites and ME (line).
Map Sizes. We performed an initial run of the experiment and we observed the maximum and minimum of values of
the behavioral characteristics covered by each algorithm. This provided a rough estimate of the range of each BC.
For the representation-based BCs, we set the range of sky tiles to [0,150] and the number of enemies to [0,25]. The
map size was 151 × 26, where each cell corresponded to an integer value of the BC. The eight agent-based binary
BCs form an eight-dimensional map of 2
8 = 256 cells. Finally, we set the KL-divergence ranges to [0, 4.5] for both
groundtruth levels, and the resolution of the map was 60 × 60.
Metrics. We evaluate all five algorithms, random search, CMA-ES, MAP-Elites, ME (line) and CMA-ME, with respect
to the diversity and quality of solutions returned. For comparison purposes, we assign the solutions by CMA-ES and
random search to a grid location on what their BC would have been and populate a pseudo-archive.
Percentage of valid cells: This is the percentage of scenes in the archive returned by the algorithm that are completed
from start to end by the A* agent, which is equivalent to having a fitness of 1.0. This is an indication of the quality of
the solutions found.
Coverage: This is the percentage of cells in the archive produced by an algorithm, computed as the number of cells
divided by the total map size. The measure indicates how much of the behavior space is covered.
QD-Score: The QD-Score metric was proposed in prior work [230] as the sum of fitness values of all elites in the
archive and has become a standard QD performance measure. The measure distills both the diversity and quality of
elites in the archive into a single value.
4.5 Results
Performance. Table 4.1 summarizes the performance of each algorithm. Fig. 4.3 shows improvement in QD-score over
evaluations for each algorithm, with 95% confidence intervals.
106
Representation-Based BCs Agent-Based BCs KL-Divergence
Algorithm Valid / All Coverage Valid / Found QD-Score Valid / All Coverage Valid / Found QD-Score Valid / All Coverage Valid / Found QD-Score
Random 8.35% 11.1% 75.3% 385.1 7.09% 8.9% 79.7% 20.2 5.10% 12.5% 40.8% 331.5
CMA-ES 7.44% 8.0% 93.0% 308.6 7.43% 8.3% 89.6% 19.8 4.11% 7.5% 54.8% 210.6
ME 15.15% 19.4% 78.1% 692.5 7.66% 8.8% 87.0% 20.4 9.98% 15.5% 64.4% 485.6
ME (line) 15.31% 18.9% 81.0% 682.7 7.06% 8.2% 86.1% 18.9 10.18% 15.4% 66.1% 488.0
CMA-ME 16.35% 21.5% 76.1% 776.8 7.90% 9.4% 84.0% 21.6 11.08% 17.4% 63.7% 551.3
Table 4.1: Results: Average percentage of cells with fitness 1.0 (Valid / All), percentage of cells found (Coverage),
percentage of cells found with fitness 1.0 (Valid / Found), and QD-score after 10,000 evaluations.
(a) Representation-based BCs (b) Agent-based BCs (c) KL-divergence
Figure 4.3: QD-Scores over time for each behavioral characteristic.
Figure 4.4: Archive for the KL-divergence behavioral characteristic metric.
First, we observe that all QD algorithms, i.e., MAP-Elites, ME (line) and CMA-ME outperform CMA-ES and
random search in the representation-based and KL-divergence BCs. This is expected, since CMA-ES optimizes only for
one objective, the playability of the scenes, rather than exploring a diverge range of level behaviors. Random search
works poorly; the reason is that we sample from the same distribution that we used for training the GAN, thus the
generated solutions follow the tile distribution of the training data, which covers only a small portion of the behavior
space.
Second, CMA-ME outperforms the other QD algorithms in the representation-based and KL-divergence BCs. This
matches previous work [91], where CMA-ME outperformed these algorithms in the Hearthstone strategic game domain.
We attribute this to the fact that CMA-ME benefits by sampling from a dynamically changing Gaussian (as in CMA-ES)
107
Large
Number
of High
Tiles
Small
Number
of High
Tiles
Small Number of Enemies Large Number of Enemies
Figure 4.5: Generated scenes using CMA-ME for small and large values of sky tiles and number of enemies.
Min Number of Mechanics Max Number of Mechanics
Figure 4.6: Playable scenes with minimum (left) and maximum (right) sum value (6) of the 8 binary agent-based BCs.
rather than a fixed distribution shape. Fig. 4.4 shows three example archives of elites for CMA-ME, MAP-Elites and
CMA-ES, illustrating the ability of CMA-ME to cover larger areas of the map.
We observe that ME (line) performs similarly to MAP-Elites. ME (line) relies on the assumption that different
elites in the archive have similar search space parameters. We estimated the similarity of the elite hypervolume as
defined in prior work [275], and found low mean values for the representation-based (0.60) and the KL-divergence
(0.58) maps, which explains the lack of improvement from the operator in this domain.
On the other hand, in the 8 binary agent-based BCs all algorithms perform similarly to random search. All of
the algorithms performed poorly on these BCs, where each algorithm discovers less than 10% of possible mechanic
combinations. The main reason lies in the way the A* agent plays the levels; the agent is designed to reach the right
edge of the screen as fast as possible, without caring much about its score. This forces the agent to avoid triggering
gameplay mechanics. For example, in Fig. 4.6(right) the agent rushes to the end without collecting the coins in the
beginning of the level. The same holds for the training data; the human-authored levels covered only 20 out of the
2
8 = 256 cells of the map, and there was no training level where the agent collected a mushroom or a coin. This makes
the task of finding levels that trigger these BCs even more challenging.
108
Large
KL-1
Small
KL-1
Small KL-2 Large KL-2
Figure 4.7: Generated scenes for small and large values of KL-divergence to each of the two groundtruth scenes.
Generated Levels. We demonstrate generated levels by the CMA-ME algorithm that illustrate its ability to generate a
diverse range of high-quality solutions.
Figure 4.5 shows four generated scenes from an archive generated by a single run of CMA-ME using the
representation-based BCs. We selected the scenes from the map that had extreme values of the two BCs, the number of
sky tiles and number of enemies. The scenes are significantly diverse, with the scene that maximizes each BC being
filled with enemies and having multiple tiles above ground. Despite the large number of sky tiles at the level in the
top-right, the agent finishes the scene without reaching most of them. This is a limitation of the representation-based
BCs, which evaluate a scene based on the distribution of tiles and not on the agent’s playthrough.
We address the above limitation with agent-based BCs. Fig. 4.6 shows two scenes generated by CMA-ME that
minimize and maximize the sum of the agent-based BC values. The first scene has 0 value for all BCs and the agent
simply runs a straight path towards the exit, while the second scene allows the agent to exhibit a variety of behaviors,
including different types of jumps, stomping on an enemy and killing an enemy with a shell.
Finally, Fig. 4.7 shows four scenes with small and large KL-divergence to each of the two groundtruth scenes in
Fig. 4.2. The scene that is stylistically similar to both groundtruths (bottom-left) combines ground tiles with gaps
that force the agent to jump. The top left level maximizes divergence with the first groundtruth scene and minimizes
divergence with the second; this results in not having any ground tiles. Interestingly, the scene in the top right maximizes
KL-divergence to both groundtruth scene by having tile types and enemies unseen in any of the groundtruth scenes.
109
4.6 Conclusion
We explored the use of QD algorithms to search the latent space of trained generator networks, to create content that
has a diverse range of desired characteristics, while retaining the style of human-authored examples. In particular,
we described an implementation where the QD algorithms MAP-Elites, MAP-Elites (line) and CMA-ME were used
to search the latent space of a DCGAN trained on level scenes from Super Mario Bros. In this problem, CMA-ME
was superior to other tested algorithms in terms of coverage and QD-score, indicating that it finds a more diverse and
high-quality set of level scenes.
QD algorithms extract a collection of scenes in a single run, rather than just one scene returned by optimization-based
methods; their use is thus recommended when a collection of diverse, high-quality content is desired. We are excited
about extending this work to search the latent spaces of other generative models, such as variational autoencoders [66]
and generative pretraining models [39]. Finally, we are excited about combining our approach with intelligent trial and
error algorithms to create personalized levels [109].
110
Chapter 5
Constraining Scenarios via Mixed Integer Programming Repair
* We focus on the problem of automatically generating video game levels that are aesthetically similar to humanauthored examples, while satisfying playability constraints. For example, we would like to generate a variety of different
Zelda levels; on one hand, we require these levels to be playable; they should include a character that the user can
control, a door to exit the level, and a key to open the door. At the same time, the levels should have an aesthetic appeal;
a level with a character, a key, and a door next to each other is technically playable, but does not have the appeal of a
level created by a human designer.
Addressing the research question of procedurally generating aesthetically appealing video game levels that satisfy
playability constraints is challenging. Using machine learning methods, level generators can be trained on existing
levels so as to learn to reproduce aspects of their style [263]. In particular, recent advancements in generative adversarial
networks (GANs) enable the creation of levels that are stylistically similar to human examples [280, 104]. At the same
time, many levels generated this way are not playable. Prior work [271] demonstrates that GANs frequently fail to
encode playability criteria.
One approach to adhering to playability constraints is to encode the search space of possible levels via constraint
programming (CP) [253] or answer set programming (ASP) [251] and then use a search algorithm to find a feasible
solution. It is not clear how to combine such methods with machine learning to reproduce a given style.
Rather than encoding constraints in the level generation process, we propose a generate-then-repair approach for
first generating levels from models trained on human-authored examples and then repairing them with minimum cost
edits to render them playable.
*Work co-led by Hejia Zhang at the ICAROS Lab.
111
Playability
Criteria
GAN
Mixed-Integer
Linear
Program
= 1
= 1
Random
Noise
Figure 5.1: An overview of our framework for generating aesthetically pleasing, playable game levels by using a
mixed-integer linear program to repair GAN-generated levels.
Human
GAN
GAN+MIP
MIPrandom
Figure 5.2: Example generated Zelda levels of different techniques. The GAN+MIP framework repairs the GANgenerated levels rendering them playable, while capturing the spatial relationships between tiles exhibited in the
human-authored levels.
Our key insight is:
We can create playable levels that are aesthetically similar to human-authored examples, by repairing
GAN-generated levels using a mixed-integer linear program with encoded playability constraints.
112
Specifically, the framework first generates levels using a GAN trained with human-authored examples. The levels
are aesthetically similar to the training levels but may not be playable. We then repair the levels using a mixed-integer
linear program with playability constraints, which minimizes the number of edits required to render the level playable.
A key component of the framework is the edit distance metric, which we cast as a minimum cost network flow problem,
where we define a separate network for each object type in the level and “flows” represent changes between the
GAN-generated level and the level generated by the MIP solver.
Fig. 5.2 shows human-authored examples, as well as levels generated by the GAN and the proposed framework
GAN+MIP, with 50 human-authored levels as training data. We additionally include a baseline, MIP (random), where
instead of a GAN-generated level, we use as input to the MIP solver a level generated by independently sampling an
object type for each tile. This results in levels that are playable, but whose tiles do not exhibit the spatial relationships
seen in the human examples.
Our results show that we can generate a diverse range of playable levels that have an aesthetic appeal from a small
number of human-authored examples.
A limitation of our approach is that aesthetic appeal is subjective and we can only use proxy metrics, such as
the spatial relationship between tiles. However, we view this as an exciting step towards procedurally generating
aesthetically pleasing levels that satisfy playability constraints.
5.1 Problem Description
We formulate the problem of procedural video game level repair as a discrete optimization problem. We represent a
level as a space graph [246], where each node in the graph is associated with a region in the game and edges model
connections between regions. For instance, in The Legend of Zelda video game, the nodes are grid cells representing an
object, such as a wall, door or empty space, and the edges connect adjacent cells.
We then formulate the procedural generation problem as a matching problem, where each node is matched to an
available object type. We note that the vast majority of random matchings lead to unplayable levels. In Zelda, there is
only one player, who should reach a key and a door to exit the level. Walls in the perimeter prevent the player from
leaving a confined area. All such constraints need to be satisfied for a level to be playable.
113
Wall Empty Key Exit door Enemy Player
Table 5.1: The visualization for the tiles in Zelda.
In addition to meeting criteria for playability, video game levels need to be aesthetically pleasing. They must look
interesting or engaging to human players. Similarly to aesthetically pleasing images, such considerations are not easily
formalized. Here, we formulate this problem as minimizing the number of edits on levels that are sampled from a
learned distribution of human-authored training examples.
5.2 Domain
We use a simplified version of The Legend of Zelda video game, implemented in the General Video Game Artificial
Intelligence (GVGAI) framework [225, 100, 224]. Nintendo first introduced The Legend of Zelda in 1986 for the
Nintendo Entertainment System (NES). The main character of the game, Link, must explore dungeons and solve puzzles
while simultaneously avoiding enemies within the level. In the GVGAI version, Link must navigate the environment to
obtain a key, then proceed to the exit door while avoiding enemies.
Fig. 5.2 (top) shows example human-authored Zelda levels, with Table 5.1 showing the different object types. Note
that all example human-authored levels are playable. In addition, we observe that the key is typically placed far away
from the door, requiring the player to navigate the level, encountering enemies in the way. This is an aesthetic quality
that is not a playability constraint, but we wish to replicate in the procedurally generated levels.
5.3 Approach
Our framework implements a generate-then-repair approach. For constructing levels adhering to an existing style,
we train a generative adversarial network (GAN) using a small number of human-authored examples. The generator
network of the GAN samples new levels using random noise as input. The framework passes the GAN-generated level
into the objective function of a mixed-integer linear program (MIP) that encodes domain-specific playability constraints.
The MIP acts as an editor, minimizing the number of corrections required to transform the GAN generated level into a
114
32 1
Random Noise
Input
256
4
128
8
64
16
8
32
8
32
64
16
128
8
256
4
1
1
Figure 5.3: DCGAN network for learning the distribution of Zelda game levels.
playable level. This way, we combine the GAN’s capability to retain the aesthetics of human-authored examples with
the MIP’s capability to guarantee formal constraints like those ensuring playability.
5.3.1 Deep Convolutional GAN
We use a Deep Convolutional GAN (DCGAN) with an identical architecture to the DCGAN from prior work [280]
(Fig. 5.3), trained with the WGAN algorithm [193]. This model, trained with WGAN, was shown to successfully
generate a variety of levels for Super Mario Bros.
5.3.2 Mixed Integer Linear Program Formulation
To model reachability as a MIP, let G(V, E) be the space graph, where V is a list of nodes and E is a list of edges. We
assume a set of O distinct object types. For each object type o, we define a vector of binary variables, o ∈ {0, 1}
|V |
,
with each element ov indicating whether object type o occupies vertex v. For example, in The Legend of Zelda there are
eight object types: wall w, empty space m, key k, door d, player p and three types of enemies e
1
, e
2
, e
3
.
Node Uniqueness Constraint. We require that each node contains exactly one type of object:
wv + mv + kv + dv + e
1
v + e
2
v + e
3
v + pv = 1, ∀v ∈ V (5.1)
Reachability Constraints. Playable levels need to satisfy reachability constraints. For instance, the player needs to be
able to reach the key object. We cast the reachability problem as a flow problem [107]. For each edge (u, v) ∈ E we
define a non-negative integer variable f(u, v) ∈ Z≥0 representing flow from u to v. We define a target set T ∈ O of
object types that need to be reachable by a source set S ∈ O. In Zelda, the door and key need to be reachable by the
115
player, i.e., T = {k, d} and S = {p}. Then, we introduce f
s
v ∈ Z≥0 variables as supplies and f
t
v ∈ {0, 1} as demands
for each node. We show the network flow equations below, where the equations apply for all nodes v ∈ V :
f
s
v ≤
X
x∈S
|V | · xv (5.2)
X
x∈T
xv = f
t
v
(5.3)
f
s
v +
X
u:(u,v)∈E
f(u, v) = f
t
v +
X
u:(v,u)∈E
f(v, u) (5.4)
f(u, v) + X
xv∈B
|V | · xv ≤ |V | , ∀u : (u, v) ∈ E (5.5)
f(u, v), f s
v ∈ Z≥0 (5.6)
f
t
v ∈ {0, 1} (5.7)
Eq. 5.2 limits supply flow to vertices that have an object in the source set S. For instance, in The Legend of Zelda
we let a node v be connected to the source when v contains the player p. Based on Eq. 5.1, pv will be 1 and all the other
object type variables (e.g., wv, mv) will be 0. Since p ∈ S, the sum in the right hand side is equal to |V | and f
s
v
for that
node can take values between 0 and |V |. On the other hand, if a node is associated with a tile that contains a non-source
node, e.g., a wall, pv for that node will be 0, forcing the sum on the right side of Eq. 5.2 to be 0 and f
s
v will be exactly 0.
Following a similar reasoning, Eq. 5.3 guarantees that node v will generate a unit of demand if it belongs to the
target set T. Eq. 5.4 specifies the flow conservation constraints, which propagate the demands from the target nodes
back to the source nodes. The constraints guarantee that all target nodes will be reached by at least one source node.
Finally, in Eq. 5.5 to ensure that no paths cross impassable objects (e.g., walls, door), we define a set of impassable
object types B and we block flow leaving nodes assigned to these object types.
Edit Distance Objective. Our goal is to minimize the number of edits, that is moving, adding or removing an object
type, that the MIP solver applies to the input level to make it satisfy the playability constraints. We cast this problem as
a minimum cost network flow problem, where we generate a network for every object type. The key intuition is that if
a node contains an object in the input level, it is a source node that supplies flow; the supplied flow can be absorbed
either by a node with an object of the same type in the MIP solution, which generates a unit of demand, or by a “waste”
116
variable that indicates deletion of an object in the original input level. The objective is to minimize the cost of the flow
that satisfies supplies and demands.
Similarly to the reachability constraints network, we specify demand variables f
t
v ∈ {0, 1} for each node and flow
variables f(u, v) for the edges. We additionally define waste variables r
t
v ∈ Z≥0.
We let cv be a constant that is equal to 1 if a node v contained the object type of the network in the input (GANgenerated) level and 0 otherwise. Eq. 5.8 specifies the flow conservation constraints, while Eq. 5.9 limits demands to
locations containing the object type o that the flow network is associated with. The equations apply for all nodes v ∈ V .
Eq. 5.10 ensures that supplies from initial object locations match demands by the new object locations or by object
deletions.
Eq. 5.11 describes the MIP objective to minimize the sum of the costs for the flow network of each object type.
The final objective value is computed by summing over all object types (Eq. 5.11). r
t
v
and f(u, v) are different for
each object type since they represent flows in different networks. Cd represents the cost for deleting an object and Cm
represents the cost for moving an object one tile. Note that we do not model the cost for adding an object as an object
must be deleted for an addition to occur. In our experiments, we selected Cd = 10 and Cm = 1, so that it is much
cheaper to move an object one cell than deleting it.
cv +
X
u:(u,v)∈E
f(u, v) = r
t
v + f
t
v +
X
u:(v,u)∈E
f(v, u) (5.8)
f
t
v ≤ ov (5.9)
X
v∈V
cv =
X
v∈V
f
t
v +
X
v∈V
r
t
v
(5.10)
X
o∈O
X
v∈V
Cdr
t
v +
X
u,v:(u,v)∈E
Cmf(u, v)
(minimize) (5.11)
Domain-specific Constraints. The above constraints and edit distance objective can generalize across a variety
of platform games, such as The Legend of Zelda and Pac-Man. However, each game has additional game-specific
constraints, which can be easily encoded in the MIP formulation. The Legend of Zelda requires that exactly one key,
117
Model Playable levels Duplicated levels Playable and unique levels
GAN 24.3% 46.9% 12.9%
MIP (random) 100% 0% 100%
CESAGAN [271] 58% 37.6% not reported
GAN + MIP 100% 14.9% 85.1%
Table 5.2: Percentage of generated playable and unique levels with each technique.
20 40
Average Hamming Distance
0.00
0.20Count
200 300
Average Edit Distance
0.00
0.04Count
Human
GAN
GAN+MIP
Figure 5.4: The distribution of the average hamming (left) and edit (right) distance between levels from the same set.
40 70
# Empty Tiles
0.0
0.2Count
GAN
Human
GAN+MIP
(a)
40 70
# Walls
0.0
0.3Count
(b)
0 8
# Enemies
0.0
0.7Count
(c)
0 40
Minimum Path
0.00
0.12Count
Human
MIP
(random)
GAN+MIP
(d)
Figure 5.5: (a-c) Distribution of different game tiles for the human examples, the levels generated by the GAN and the
levels generated by the GAN and edited with the MIP solver. (d) Distribution of minimum paths from key to door.
one door, and one player are present in the level, enemies cover less than 60% of the available space to ensure the level
is not too difficult for the player, and the outer perimeter of the level needs to be filled with wall objects. We include
these additional constraints in the MIP formulation.
5.4 Empirical Evaluation
To evaluate our method we train the GAN on a corpus of 50 human authored levels from Zelda for 24,000 epochs.
We then generate 1000 levels using the GAN+MIP framework by sampling through Gaussian noise on the generator
118
network and compare against three baselines: levels generated by GAN without the MIP editing and levels generated by
the MIP solver by minimizing the edit distance to a randomly generated level and performance results from a recent
study [271], which uses an adapted self-attention GAN, named CESAGAN. For the randomly generated levels, we
sample an object for each tile from a multinomial distribution, where the probabilities match the object frequency
counts in the human authored levels. CESAGAN captures non-local dependency between game objects. It was trained
on 45 out of the same 50 human-authored levels as the proposed GAN+MIP framework. Note that CESAGAN tries to
learn level constraints through bootstrapping and does not require explicitly encoded constraints.
Diversity of Playable Levels. We evaluate the generated levels by testing the playability criteria specified in prior
work [271]. Levels are also measured for duplicates by counting the number of unique levels produced by the generator
and reporting the percentage of additional levels that are duplicates of the unique levels. Table 5.2 shows our results.
Results show that most of the GAN-generated levels are not playable, while the proposed framework generates a
large number of unique levels that are all playable, since they satisfy the constraints encoded in the MIP. Moreover, we
notice that from the duplicate levels generated by the proposed framework, 77% were generated directly from the GAN.
For the remaining duplicate repair levels, each level was stylistically similar; for instance, two distinct levels had a
missing boundary tile in different locations, but they were repaired to be identical by filling each missing boundary. On
average it took the MIP solver [139] 0.13 seconds to fix generated levels, where each MIP program consisted of 6858
variables and 2777 constraints.
We further analyze the diversity of the generated levels by using the average Hamming distance metric [271]
(number of different tiles) between each generated level with all other playable levels, as well as the proposed edit
distance metric. We randomly selected 243 playable levels out of the 1000 generated ones for the proposed framework,
in order to match the number of playable levels generated by the GAN. Fig. 5.4 shows that, while the human-authored
levels have higher diversity, the proposed framework generates levels that are more diverse than the GAN levels. GAN
learns to generate levels from the distribution of human examples, but the generated playable levels are only a small
part of the learned distribution. On the other hand, GAN+MIP repairs all GAN-generated levels, capturing a larger part
of the training distribution.
We note that the levels generated by MIP-random were all unique and had higher diversity metrics (not shown in
the figures), because of the stochastic nature of the random level generation process.
119
Figure 5.6: Edit distance example. (Left) An unplayable generated level by the GAN network. (Center) The output
of the MIP solver that minimizes the Hamming distance to the input level. (Right) The output of the MIP solver that
minimizes the edit distance.
Aesthetic Appeal of Generated Levels. We assess whether the generated levels are aesthetically similar to the
human-authored ones. Following previous work [271], we compute the distribution of different game tiles for the
human, the GAN and the GAN+MIP levels (Fig. 5.5a-5.5c). We observe that the GAN+MIP matches closer the human
distribution, since GAN+MIP repairs all GAN-generated levels, capturing a larger part of the training distribution.†
We also compute the tile-pattern Kullback–Leibler (KL) divergence [189] between the distribution of the tile
patterns of the generated levels and the human-authored levels. We first extract a set of tile patterns by sliding a 2x2
fixed-size window over each level. We then empirically estimate the probability of each pattern for each set of levels
and then compute the KL divergence between the two distributions.
The value of the 243 GAN playable levels is 0.272. We randomly selected 243 playable levels from the GAN+MIP
set, to match the number of the GAN levels. The value was 0.108, indicating higher similarity to the human examples.
Comparison with MIP (random). Since the MIP (random) baseline generates unique, playable levels that follow the
tile distribution of the human-authored levels, what is the benefit of using a generative adversarial network?
GANs learn spatial relationships between objects in the level. Specifically, inspection of the human-authored levels
in Fig. 5.2 shows that human designers tend to place the key far away from the door, so that the player needs to explore
the level before exiting it. This aesthetic quality is lost in many of the random generated levels.
We support this argument in Fig. 5.5d, which shows the distribution of the length of the minimum paths, computed
with a standard A* algorithm [239], from the key to the door. We observe that the distribution is shifted towards larger
paths for the GAN+MIP levels, compared to the MIP (random) levels. The average length of the minimum path was 8.6
for GAN+MIP, compared to 7.8 for MIP (random) and 12.5 for the human.
†We exclude the baseline MIP (random) from the analysis, since the objects in the input levels are sampled by the tile distribution of the
human-authored levels, therefore the levels follow closely, albeit not exactly because of the editing, the human distribution.
120
5.5 Edit Distance Objective
A simpler optimization objective for the MIP program would be the Hamming distance (number of different tiles) rather
than the edit distance between the input and the generated level. However, this metric does not capture the spatial
relationships between tiles. For instance, in Fig. 5.6 minimizing the Hamming distance results in replacing the enemy
on the left hand side with the wall. In the edit distance metric of Eq. 5.11, the cost of deleting an enemy (Cd = 10) is
larger than the cost of moving it (Cm = 1), therefore the solver chooses to move the enemy and a neighboring wall tile
so that they exchange positions. This preserves the level topology by retaining the third enemy.
5.6 End-to-End Training
We explore integrating playability constraints as an additional layer in the GAN network. Recent advances in differential
optimization [287, 80] have allowed integrating discrete optimization problems into deep learning models trained with
gradient descent. We include a differentiable MIP program [80] as an additional layer in the GAN network and train the
GAN-MIP in an end-to-end manner by passing to the discriminator the levels generated by the generator after they are
repaired by the MIP solver.
Training the network with all playability constraints is computationally expensive. On the other hand, we observed
that the main reason that GAN-generated levels are rendered unplayable is the violation of numerical constraints, such
as number of players, doors and keys, which matches results from previous work [271]. For computational efficiency,
we encoded only these constraints in the MIP program, used the simpler Hamming distance objective and applied
LP-relaxation [287], which significantly sped up the training process.
We trained the resulting network for 5000 epochs, which lasted 55 hours on an Intel Core i7-8700K 3.7GHz
processor. The generator network generated 747 unique and playable game levels out of 1000, which is a significant
improvement to the initial GAN model. The average length of the minimum path from the key to the door was 10.1,
and the KL divergence between the tile distribution of the generated levels and the human-authored levels was 0.055.
These preliminary results indicate the promise of integrating MIP constraints in the GAN training process.
121
Figure 5.7: GAN-generated Pac-Man level (left) and the same level repaired by the MIP solver (right).
5.7 Beyond Zelda Levels
The proposed framework applies also to other domains, such as Pac-Man game levels. Modeling playability constraints
in Pac-Man required small modifications to the Zelda MIP. First, the dynamics of Pac-Man allow for the character
to leave the left side of the screen to enter the right. We modify the space graph to have additional edges between
the left column’s nodes and the right column’s nodes. Similarly, we add edges between the top row’s nodes and the
bottom row’s nodes. Pac-Man levels are required to have no dead ends. To model this in the MIP, we require that all
free space objects are adjacent to at least two neighbors (in graph theory terms: the node is not a leaf). Fig. 5.7 (left)
shows an example level generated by the DCGAN, trained over 45 human-edited examples. The level has dead-ends
and the wraparound on the edges of the screen is incorrect. Fig. 5.7 (right) shows the repaired level by a MIP program
that minimizes the edit distance to the previous level while satisfying the no-deadend and wraparound constraints, in
addition to requiring the player to be able to reach all pellets and enemies in the level. We note that the MIP solver [139]
took 3.22 seconds on average to fix the generated levels. The repairs took longer than the Zelda domain as the Pac-Man
domain has more constraints and a larger space graph. Each MIP program for Pac-Man consisted of 65352 variables
and 26252 constraints, compared to 6858 variables and 2777 constraints for Zelda.
122
5.8 Generality and Limitations
In terms of expressive power, any satisfiability (SAT) program can be modelled as a MIP. However, not all constraint
programs (CP) can be modelled as a mixed integer linear program due to the linearity requirements. What MIPs lack
in generality they make up in performance of the solver. Modern MIP solvers can solve programs with millions of
variables and constraints, in general orders of magnitude larger than general CP solvers. Moreover, problems like flow
have good LP relaxations, allowing for the subprogram to be solved in polynomial time.
While further research is needed in this direction, we argue that many of the PCG methods currently being modelled
as CP can be modelled as MIPs, allowing for larger levels to be solved by more efficient solvers. For example, in Zelda
we may want to generate levels with two doors, where the player must not be able to reach the second door without
passing through the first door. The problem can be modelled as an (s, t)-cut problem, which (like flow) can be modelled
as a linear program. Our goal is to require a 0 cost cut between the player and second door. We link the player to the
source vertex and the second door to the sink vertex. Variables d(u, v) ∈ Z≥0 represent a decision variable marking
whether edge (u, v) should be in the cut. Then we add constraints that enforce d(u, v) to be zero for edges leaving free
space nodes forcing the cut to only cut edges leaving blocking nodes (i.e. the walls or door one).
More complex reachability constraints can also be modelled. Prior work [137] demonstrated a constraint programming method for modeling path constraints between fixed points in a level. The CP required that a player could reach
enough resources (e.g. ammo, armor) to complete the level. To model this constraint as a MIP we could model the
reachability as flow with costs rather than flow, where costs represent obtaining resources. Then we require that the cost
of the path from source to sink is constrained within a specific range. Finally, aesthetic constraints like symmetry can
be modelled by adding constraints for each pair of nodes on opposite sides of a level requiring each object assignment
variable ov equal it’s counterpart across the line of symmetry.
While MIPs can be used as powerful modeling tools, applying these techniques to level repair requires expertise in
MIP modeling. Common problems, such as reachability, can be modelled from their graph-based representation and
converted to MIP constraints within our framework to ease the burden of designers. Further research is needed to model
games with complex physics for character movement. Examples include platformer games like Mario and Sonic the
Hedgehog. For these games heuristic modeling, such as requiring two platforms are close enough for the character to
jump between platforms, is needed.
123
5.9 Related Work
Two works [280, 104] presented the first methods applying GANs to the problem of automatic video game level
generation. To assure that generated levels satisfy specific criteria, one approach [280] adapted latent variable
evolution (LVE) from prior work [25] to search the generative space induced by the generator network with CMAES [127, 123]. The other work [104] assured that levels met necessary criteria through generate-and-test methodology.
Later work [271] showed that GANs can fail to capture logical constraints of the video game levels and invented
a bootstrapping method that incorporates generated playable levels back into a continually expanding training data
set. Another approach [254] introduced a markov chain method for guaranteeing aspects of the layout in Mario.
A generate-and-test method guaranteed reachability constraints by running A* agents on generated levels. As an
alternative to GANs, WaveFunctionCollapse [161] generates levels that match example data by compiling the data into
constraints that can be used to generate levels of similar visual style. However, note that such approaches cannot model
complex playability constraints specified by a user.
Several authors present machine learning approaches to satisfying level constraints. However, these approaches
make no guarantees about successfully generating or repairing levels to satisfy all constraints. Prior work [160] proposed
using discriminator networks, trained on positive and negative examples, to guide WaveFunctionCollapse towards
playable levels. Another work [142] proposed using autoencoders as a repair method for broken levels by passing
broken components through an autoencoder trained on valid levels. However, each of these approaches assumes that the
deep learning model can capture complex logical constraints, which often require specialized models [284].
As an alternative to procedural content generation via machine learning (PCGML), there exist several methods
for declarative modeling of procedurally generated content. Prior work [251] presented a declarative method of PCG
allowing the generative space of possible levels to be encoded as an answer set program. Another work [253] presented
an interactive constraint programming method for allowing humans to coauthor Mario levels with the generative method.
Prior work [137] presented a constraint programming method for modeling path constraints between two fixed points
in a space graph. Their method modelled paths as a system of linear equations, which we note is equivalent to flow
conservation constraints used in our method (linear programming is a generalization of linear systems of equations).
Our method can therefore be thought of as a generalization of their approach that allows for endpoints of the path
to be placed dynamically and can model reachability across a set of object types. Constraints can also be encoded
124
in a multi-objective evolutionary algorithm, either as part of the objective function, or separately, e.g., by dividing a
population of solutions into feasible and infeasible individuals [166, 256]. The two-population constraint handling
approach can also be used together with quality diversity algorithms [184, 165] to generate a diverse range of levels.
Our work benefits from recent advancements combining machine learning with traditional optimization methods.
Recent works have introduced quadratic programming [6], linear programming [287], and satisfiability [284] solvers
as layers of deep neural networks. Related methods incorporate submodular optimization [287] as a layer in a neural
network.
5.10 Conclusions
Limitations. Our work is limited in many ways. Ensuring reachability constraints requires the agent dynamics be
modeled as a finite graph, with vertices representing discrete regions in the level and edges indicating neighboring
vertices. Levels that satisfy playability constraints may still be unplayable in practice, for instance if difficulty is too
high. Additionally, the objective metrics for aesthetic similarity to human examples are only approximate metrics.
Evaluating playability and aesthetic appeal with user studies is a natural extension of this work.
Implications. We have presented a generate-then-repair framework for constructing levels using models trained
over human-authored examples and repairing the levels with a mixed-integer linear program. Our editing method is
agnostic to how levels are generated; we are excited to explore the limits of our approach by repairing different types of
procedurally generated content (levels, objects, enemies) that need to satisfy explicitly defined constraints.
125
Chapter 6
A General Framework for Searching Over Complex Environments
6.1 Introduction
* When humans and robots coordinate well, they time their actions precisely and efficiently and alter their plans
dynamically, often in the absence of verbal communication. Evaluation of the quality of coordination has focused not
only on task efficiency but on the fluency of the interaction [135]. Fluency refers to how well the actions of the agents
are synchronized, resulting in coordinated meshing of joint activities.
A closely related, important aspect of human-robot teaming is workload assignment. Human factors research
has shown that too light or too heavy workload can affect human performance and situational awareness [219]. The
perceived robot’s contribution to the team is a crucial metric of fluency [135], and human-robot teaming experiments
found that the degree to which participants were occupied affected their subjective assessment of the robot as a
teammate [108].
To achieve fluent human-robot coordination, previous work [135] enabled robots to reason over the mental states
and actions of their human partners, by building or learning human models and integrating these models into decision
making. While the focus has been on the effect of these models on human-robot coordination, little emphasis has been
placed on the environment that the team occupies.
Our thesis is that changing the environment can result in significant differences between coordination behaviors,
even when the robot runs the same coordination algorithm. We thus advocate for considering diverse environments
when studying the emergent coordination behaviors of human-robot teams.
*Work co-led by Sophie Hsu and Yulun Zhang at the ICAROS Lab.
126
Figure 6.1: An overview of the framework for procedurally generating environments that are stylistically similar to
human-authored environments. Our environment generation pipeline enables the efficient exploration of the space of
possible environments to procedurally discover environments that differ based on provided metric functions.
Manually creating environments that show a diverse range of coordination behaviors requires substantial human
effort. Furthermore, as robotic systems become more complex, it becomes hard to predict how these systems will act in
different situations and even harder to design environments that elicit a diverse range of behaviors.
This highlights the need for a systematic approach for generating environments. Thus, we propose a framework for
automatic environment generation, drawing upon insights from the field of procedural content generation in games [246,
89, 291]. The framework automatically generates environments which (1) share design similarities to human-authored
environments provided as training data, (2) are guaranteed to be solvable by the human-robot team, and (3) result in
coordination behaviors that differ according to provided metrics (i.e. team fluency or workload).
In this chapter, we study our framework in the collaborative game Overcooked [35], an increasingly popular domain
for researching the coordination of agent behaviors. In this domain, a human-robot team must work together to cook
and serve food orders in a shared kitchen. In the context of Overcooked, our framework generates kitchen environments
that cause the human-robot team to behave differently with respect to workload, fluency, or any other specified metric.
127
Fig. 6.1 provides an overview of our framework. First, we train a generative adversarial network (GAN) [110]
with human-authored kitchens as examples. The GAN learns to generate kitchens that share stylistic similarities to
the human-designed kitchens. However, GANs frequently generate kitchens which are, for the human-robot team,
impossible to solve. For example, ingredients or the serve station may be unreachable by the agents, or the kitchen
may not have a pot. To guarantee the kitchen is solvable, mixed-integer linear programming (MIP) [291] edits the
kitchen with a minimum-cost repair. By guaranteeing domain-specific constraints, the GAN+MIP pipeline forms a
generative space of viable kitchens that can be sampled through the latent codes of the GAN. We then search the
latent space directly with a state-of-the-art quality diversity algorithm, Covariance Matrix Adaptation MAP-Elites
(CMA-ME) [91], to discover kitchens that cause diverse agent behaviors with respect to specified coordination metrics.
Generated kitchens are added to an archive organized by the coordination metrics, and feedback from how the archive
is populated helps guide CMA-ME towards kitchens with underexplored metric combinations.
Evaluation of our framework in simulation shows that the generated environments can affect the coordination of
human-robot teams that follow precomputed jointly optimal motion plans, as well as of teams where the robot reasons
over partially observable human subtasks. In an online user study, we show that our generated environments result
in very different workload distributions and team fluency metrics, even when the robot runs the same algorithm in all
environments.
Overall, we are excited to highlight the role that environments play in the emergent coordination behaviors of
human-robot teams, and to provide a systematic approach for procedurally generating high-quality environments that
induce diverse coordination behaviors.
6.2 Background
6.2.1 Human-Aware Planning.
In the absence of pre-coordination strategies, robots coordinate with human teammates by reasoning over human
actions when making decisions. While many works have studied human-aware planning [4, 37, 170, 190, 30], most
relevant to this work are POMDP frameworks, where the robot observes human actions to infer the internal state of
a human. POMDP-based models have enabled communication with unknown teammates [14], inference of human
128
preference [210] and human trust [40] in human-robot collaboration, and inference of human internal state [242] in
autonomous driving applications.
Since the exact computation of a POMDP policy is computationally intractable [218], researchers have proposed
several approximation methods. One such approximation is the QMDP, where the robot estimates its current actions
based on the current belief and the assumption of full observability at the next time step [186]. Though the robot does
not take information-gathering actions in this approximation, QMDP has been shown to achieve good performance in
domains when the user continuously provides inputs to the system, such as in shared autonomy [148].
6.2.2 Procedural Content Generation.
Procedural content generation (PCG) refers to algorithmic, as opposed to manual, generation of content [246]. A
growing research area is PCG via machine learning (PCGML) [263], where content is generated with models trained on
existing content (e.g., [262, 255, 117]).
Our work builds on the recent success of GANs as procedural generators of video game levels [280, 104]. However,
generating video game levels is not as simple as training a GAN and then sampling the generator since many generated
levels are unplayable. Previous work [280] addressed this by optimizing directly in latent space via Latent Variable
Evolution (LVE) [25]. Specifically, the authors optimize with the Covariance Matrix Adaptation Evolution Strategy
(CMA-ES) [123] to find latent codes of levels that maximize playability and contain specific characteristics (e.g., an
exact number of player jumps) in the game Super Mario Bros..
However, game designers rarely know exactly which properties they want the generated levels to have. Later work
proposed Latent Space Illumination [89, 244] by framing the search of latent space as a quality diversity (QD) [38]
problem instead of a single-objective optimization problem. In addition to an objective function, the QD formulation
permits several measure functions which form the behavior characteristics (BCs) of the problem. The authors generated
levels that maximized playability but varied in measurable properties, i.e., the number of enemies or player jumps.
Their work showed CMA-ME [92] outperformed other QD algorithms [202, 275] when directly illuminating latent
space. In the case where the objective and measure functions are differentiable, recent work shows that differentiable
quality diversity (DQD) [87] can significantly improve search efficiency.
129
As stated above, GAN-generated environments, including those generated with LSI, are frequently invalid. For
instance, in Overcooked, they may have more than one robot or it may be impossible for players to reach the stove.
Previous work [291] proposed a mixed-integer linear programming (MIP) repair method for the game The Legend
of Zelda, which edits the input levels to satisfy formal playability constraints. The authors formulate the repair as a
minimum edit distance problem, and the MIP repairs the level with the minimum total cost of edits possible.
Our work integrates the GAN+MIP repair method with Latent Space Illumination. This allows us to move the
objective of LSI away from simply generating valid environments to generating environments that maximize or minimize
team performance. Additionally, while previous work on LSI [244, 89] generated levels that were diverse with respect
to level mechanics and tile distributions, e.g., number of objects, we focus on the problem of diversity in the agent
behaviors that emerge from the generated environments.
6.2.3 Overcooked.
Our coordination domain is based on the collaborative video game Overcooked [215, 216]. Several works [35, 281]
created custom Overcooked simulators as a domain for evaluating collaborative agents against human models [35] and
evaluating decentralized multi-agent coordination algorithms [281].
We use the Overcooked AI simulator [35], which restricts the game to two agents and a simplified item set. In this
version, the goal is to deliver two soup orders in a minimum amount of time. To prepare a soup, an agent needs to pick
up three onions (one at a time) from an onion dispenser and place them in the cooking pot. Once the pot is full and 10
timesteps have passed, the soup is ready. One of the agents then needs to retrieve a dish from the dish dispenser, put the
soup on the dish, and deliver it to the serving counter (Fig. 6.2).
6.3 Approach
6.3.1 Overview.
Our proposed framework consists of three main components: 1) A GAN which generates environments and is trained
with human-authored examples. 2) A MIP which edits the generated environments to apply domain-specific constraints
that make the environment solvable for the desired task. 3) A quality diversity algorithm, CMA-ME, which searches the
130
Serving
Counter
Cooked
Pot
Partially
Full Pot
Robot Empty Pot
Human
Dish
Dispenser
Cooking
Pot
Onion
Dispenser
Cook Soup
Deliver Soup
Dish Robot Empty Pot
Dispenser
Human
Cooking
Pot
Onion
Dispenser
Cooking
Pot
Empty
Pot
Human
Cooking
Pot
Figure 6.2: Overcooked environment, with instructions for how to cook and deliver a soup.
latent space of the GAN to generate environments that maximize or minimize team performance, but are also diverse
with respect to specified measures.
6.3.2 Deep Convolutional GAN.
Directly searching over the space of possible environments can lead to the discovery of unrealistic environments.
To promote realism, we incorporate GANs into our environment generation pipeline. Prior work [291, 89] in PCG
131
Human
GAN
GAN+MIP
MIPrandom
Figure 6.3: Example Overcooked environments authored by different methods. The environments generated with the
GAN+MIP approach are solvable by the human-robot team, while having design similarity to the human-authored
environments.
32 1
Latent Vector 128
4
BN+ReLU
64
8
BN+ReLU
8
16
Tanh
8
16
Input Level
64
8
LeakyReLU
128
4
BN+LeakyReLU
1 1
Sigmoid
Figure 6.4: Architecture of the GAN network.
demonstrates that in the Super Mario Bros. and Zelda domains, GANs generate video game levels which adhere to the
design characteristics of their training dataset.
6.3.3 Mixed-Integer Program Repair.
While GAN-generated environments capture design similarities to their training dataset, most GAN-generated environments are unsolvable: GANs represent levels by assigning tile-types to locations, and this allows for walkable regions to
become disconnected or item types to be missing or too frequent. In general, GANs struggle to constrain the generative
environment space to satisfy logical properties [271].
132
To compensate for the limitations of GANs, we build upon the generate-and-repair approach proposed in previous
work [291], which repairs environments with mixed-integer linear programs (MIPs) that directly encode solvability
constraints. To ensure that repairs do not deviate too far from the GAN-generated environment, the MIP minimizes the
edit distance between the input and output environments. The result is an environment similar to the GAN-generated
one that satisfies all solvability constraints.
In Overcooked, we specify MIP constraints that (1) bound the number of each tile type (i.e. starting agent locations,
counters, stoves, floor), (2) ensure that key objects are reachable by both players and (3) prevent the agents from
stepping outside the environment.
The third row of Fig. 6.3 shows example environments generated by the GAN+MIP approach. We observe that the
environments allow execution of the Overcooked game, while appearing similar to the GAN levels in the second row.
In contrast, the fourth row (MIP-random) shows environments generated from random tile assignments passed as inputs
to the MIP. While all environments are solvable, they are stylistically different than the human-authored examples: each
environment appears cluttered and unorganized.
6.3.3.1 Mixed-Integer Linear Program Formulation.
We adapt the problem formulation from [291] for repairing The Legend of Zelda levels to Overcooked. Since the
Overcooked environments require different constraints to guarantee environment solvability for the human-robot team,
our exact formulation differs in the type of objects and domain-specific constraints. For replicability and completeness
we provide the entire MIP formulation.
To generate solveable environments, we first formulate the space graph [246] that governs agent motion in possible
environments. Let G = (V, E) be a directed space graph were V is the vertex set and E is the edge set. Each vertex in
V represents a location an object will occupy. Now consider the movement dynamics of each agent (human or robot)
where each agent can move up, down, left, or right. Each edge (i, j) ∈ E represents possible motion between location i
and location j for an agent if no object impedes their motion.
To generate an environment, we solve a matching problem between object types O and locations V . In the simplified
Overcooked AI environment there are 8 object types and 15 × 10 = 150 different tile locations. If unconstrained
133
further, there are 8
150 ≈ 2.9 · 10135 different environments possible. Object types O include the human h, the robot r,
countertops c, empty space (floor) e, serve points s, dish dispensers d, onion dispensers n, and pots (with stoves) p.
To formulate the matching in the MIP, we create a vector of binary decision variables for each pair of object type
o ∈ O and location v ∈ V in the space graph. For example, if variable sv were assigned to 1, then vertex v in the space
graph would contain a serve point. Assigning sv to 0 means that vertex v does not contain a serve point. Finally, we
constrain each vertex to contain exactly one object type:
hv + rv + cv + ev + sv + dv + nv + pv = 1, ∀v ∈ V (6.1)
6.3.3.2 Solvability Constraints.
While the above formulation ensures that a feasible solution to the MIP results in an Overcooked environment, the
generative space of environments must be further constrained to ensure each generated environment is solveable by the
human-robot team.
Importantly, agents must be able to reach key objects in the environment. For example, the human must have an
unobstructed path to the stove, dish dispenser, ingredients, and serve point.
We model the reachability problem as a flow problem [107]. Network flow can be modelled as a linear program,
and as a result we can incorporate flow constraints into our MIP formulation. To model flow, we create non-negative
integer variables f(u, v) ∈ Z≥0 for each edge e = (u, v) ∈ E in the space graph G.
Now consider special object types S ⊆ O, for source object types, and T ⊆ O, for sink object types. We require
that a path from a source object exists to each sink object. Specifically, we require that a path exists from the human
h to all empty space, serve points, dish dispensers, onion dispensers, pots, and the robot by setting S = {h} and
T = {e, s, d, n, p, r}. Note that if we allow the human to reach the robot, then the robot can reach all other objects in T.
†
However, we must also require that the path is unobstructed. Therefore, each path must not pass through countertops
or other objects that impede movement. Let B ⊆ O be the set of all object types that can impede movement. In
Overcooked, we set B = {c, s, d, n, p}. To guarantee that we do not pass through a location with an object of type B,
†This holds only if the free space does not form a line graph, but empirically we found this constraint to be sufficient for large environments.
134
we will restrict any flow from leaving vertices assigned an object type in B. By restricting flow from leaving blocking
objects instead of entering them, we allow for flow to reach key objects T that are also blocking objects B.
To complete our flow modeling, for each vertex v ∈ V we create non-negative supply variables, f
s
v ∈ Z≥0, and
demand variables, f
t
v ∈ Z≥0. These variables are enough to define a flow network between S and T:
f
s
v ≤
X
x∈S
|V | · xv (6.2)
f
t
v =
X
x∈T
xv (6.3)
f
s
v +
X
u:(u,v)∈E
f(u, v) = f
t
v +
X
u:(v:u)∈E
f(u, v) (6.4)
f(u, v) + X
x∈B
|V | · xu ≤ |V |, ∀u : (u, v) ∈ E (6.5)
Equation (6.2) ensures that there is supply flow only in vertices in the space graph where a location v ∈ V is
assigned an object type from S. Note that multiple units of flow can leave source locations as we wish to reach many
object types, but no more than |V | locations exist with objects. Equation (6.3) creates exactly one unit of demand if
location v ∈ V is assigned an object from T. Equation (6.4) is a flow conservation constraint and ensures that flow
entering a location v ∈ V equals the flow leaving v. Equation (6.5) ensures that no flow leaves a location v ∈ V that is
assigned a blocking object.
In addition to reachability constraints, we introduce domain-specific constraints on the frequencies of objects of
each type and ensure that neither agent can step outside the environment. First, we require that all locations on the
border of the environment must be a blockable object type from B, and the environment contains exactly one robot r
and human h. Next, each environment requires at least one instance of a serve point s, an onion dispenser n, a dish
dispenser d, and a pot p to make it possible to fulfill food orders. Finally, we upperbound the number of serve points
s, onion dispensers n, dish dispensers d, and pots p to reduce the complexity of the environment regarding runtime
planning:
135
1 ≤
X
v∈V
sv ≤ 2 1 ≤
X
v∈V
nv ≤ 2
1 ≤
X
v∈V
dv ≤ 2 1 ≤
X
v∈V
pv ≤ 2
(6.6)
X
v∈V
sv +
X
v∈V
nv +
X
v∈V
dv +
X
v∈V
pv ≤ 6 (6.7)
6.3.3.3 Objective.
The constraints of our MIP formulation make any valid matching a solvable environment. To make the MIP a repair
method, we introduce a minimum edit distance objective from a provided input environment Ki (in Overcooked a
kitchen) to the repaired environment Kr our MIP generates.
We define an edit as moving an object through a path in the space graph or changing which object type occupies a
location. We wish to consider moving objects first, before changing the object type. Therefore, we permit different
costs for different edit operations, with moving having a smaller cost than changing object type. A minimum cost edit
repair discovers a new environment Kr that (1) minimizes the sum of costs for all edits made to convert Ki
to Kr, and
that (2) satisfies all solvability constraints.
Following previous work [291], we formalize minimum cost repair as a minimum cost matching problem. Intuitively,
we construct a matching between objects at locations in environment Ki and objects at locations in environment Kr.
Instead of considering all pairs of object locations between Ki and Kr, we construct a matching as paths through the
space graph G from all objects in Kr to all objects in Ki
. Constructing matchings as paths allows us to assign costs
based on the length of the path and corresponds to moving each object.
Formally, we model our minimum cost matching as a minimum cost network flow problem‡
for each object type.
Consider creating a flow network for object type o ∈ O. First, we create a supply indicator cv (a constant). We assign
cv = 1 if and only if Ki has an object of type o to location v and cv = 0 otherwise. We then create a demand variable
f
t
v ∈ {0, 1} for each vertex v ∈ V .
‡Note that these are separate networks than the one defined for the reachability problem.
136
f
t
v ≤ ov (6.8)
cv +
X
u:(u,v)∈E
f(u, v) = r
t
v + f
t
v +
X
u:(v,u)∈E
f(u, v) (6.9)
X
v∈V
cv =
X
v∈V
f
t
v +
X
v∈V
r
t
v
(6.10)
Equation (6.8) guarantees only vertices assigned objects of type o have demands. Equation (6.9) ensures flow
conservation for each vertex and equation (6.10) ensures that supplies and demands are equal.
The edit distance objective becomes minimizing the cost of deleting (Cd = 20) and moving (Cm = 1) objects:
X
o∈O
X
v∈V
Cdr
t
v +
X
u,v:(u,v)∈E
Cmf(u, v)
!
(6.11)
6.3.3.4 MIP Implementation
We implement our MIP interfacing with IBM’s CPLEX library [139]. Each MIP consists of 8850 variables and 3570
constraints to repair a 15 × 10 Overcooked environment.
6.3.4 Latent Space Illumination.
While the GAN generator and MIP repair collectively form a generative space of solvable environments, simply
sampling the generative space is not efficient at generating diverse environments.
We address this issue by formulating the problem as a Latent Space Illumination (LSI) problem [89] defined below.
Solving the LSI problem allow us to extract environments that are diverse with respect to fluency metrics while still
maximizing or minimizing team performance.
6.3.4.1 Problem Formulation
LSI formulates the problem of directly searching the latent space as a quality diversity (QD) problem. For quality, we
provide a performance metric f that measures team performance on the joint task, e.g, time to completion or number of
tasks completed. For diversity, we provide metric functions which measure how environments should vary, e.g. the
distribution of human and robot workloads. These metrics, defined in QD as behavior characteristics (BCs), form a
Cartesian space known as a behavior space. In the MAP-Elites [53, 202] family of QD algorithms, the behavior space is
137
partitioned into N cells to form an archive of environments. Visualizing the archive as a heatmap allows researchers to
interpret how performance varies across environments inducing different agent behaviors.
LSI searches directly for GAN latent codes z. After simulating the agents on the generated environment, each code
z maps directly to a performance value f(z) and a vector of behaviors b(z). We discuss different performance metrics
and BCs in section 6.5.
The objective of LSI is to maximize the sum of expected performance values f:
M(z1, ..., zN ) = maxX
N
i=1
E[f(zi)] (6.12)
In Eq. 6.12, zi refers to the latent vector occupying cell i. We note that the human and robot policies and the
environment may be stochastic, therefore we estimate the expected performance over multiple trial runs.
6.3.4.2 CMA-ME for Latent Space Illumination
We choose Covariance Matrix Adapation MAP-Elites (CMA-ME) [91], a state-of-the-art QD algorithm, to solve the
LSI problem. CMA-ME outperformed other quality diversity algorithms when illuminating the latent space of a GAN
trained to generate Super Mario Bros. levels [89].
CMA-ME combines MAP-Elites with the adaptation mechanisms of CMA-ES [123]. To generate new environments,
latent codes z are sampled from a Gaussian N (µ, C) where each latent code z corresponds to an environment. After
generating and repairing the environment with the GAN and MIP, we simulate the agents in the environment and
compute agent performance f and behavior characteristics b. The behavior b then maps to a unique cell in the archive.
We compare our new generated environment with the existing environment of that cell and replace the environment if
the new environment has a better f value. The distribution N (µ, C) is finally updated based on how the archive has
changed, so that it moves towards underexplored areas of behavior space.
6.4 Planning Algorithms
We consider two planning paradigms: (1) where human and robot follow a centralized joint plan and (2) where the
robot reasons over the partially observable human subtask.
138
Centralized Planning. Here human and robot follow a centralized plan specified in the beginning of the task. We
incorporate the near-optimal joint planner of previous work [35], which pre-computes optimal joint motion plans
for every possible start and goal location of the agents, and optimizes the motion plan costs with an A∗ planning
algorithm [129].
6.4.1 Human-Aware Planning.
6.4.1.1 Robot Planner.
We examine the case where the robot is not aware of the subtask the human is currently aiming to complete, e.g., picking
up an onion. We model the robot as a QMDP planner, wherein the human subtask is a partially observable variable.
To make the computation feasible in real-time, we only use the QMDP to decide subtasks, rather than low-level
actions. To move to the location of each subtask, the robot follows a motion plan precomputed in the same way as in
the centralized planning. The motion plan assumes that both agents move optimally towards their respective subtasks.
The QMDP planner adapts to the human: it observes the human low-level motions to update its belief over the
human subtasks, and selects a subtask for the robot that minimizes the expected cost to go.
Figure 6.5: Human subtask state machine. The first element in the tuple is the object held by the simulated human; the
second element is the subtask the human aims to complete.
We specify the QMDP as a tuple {S, A, T, Ω, O, C}:
• S is the set of states consisting of the observable and partially observable variables. The observable variables
include the robot and human’s held object, the number of items in a pot, and the remaining orders. The
non-observable variable is the human subtask.
• A is a set of robot subtasks that include picking up or dropping onion, dishes and soups.
139
• T : S × A → Π(S) is the transition function. We assume that the human does not change their desired subtask
until it is completed and the environment is deterministic. Once a subtask is completed, the human chooses the
next feasible subtask with uniform probability.
• Ω is a set of observations. An observation includes the state of the world, e.g., number of onions in the pot, the
human position and the current low-level human action (move up, down, left, right, stay, interact).
• O : S → Π(Ω) is the observation function. Given a human subtask, the probability of an observation is
proportional to the change caused by the human action in the cost of the motion plan to that subtask. This makes
subtasks more likely when the human moves towards their location.
• C : S × A → R is the immediate cost. It is determined as the cost of the jointly optimal motion plan for the
human and robot to reach their respective subtasks from their current positions.
Every time the human completes a subtask, the robot initializes its belief over all feasible subtasks with a uniform
distribution, and updates its belief by observing the world after each human action. The feasible subtasks are determined
by modeling the human subtask evolution with a state machine, shown in Fig. 6.5.
The best action is selected by choosing the robot subtask with the highest expected value Q(s, a) = E[C(s, a) +
V (s
′
)], given the current belief on the human subtask.
6.4.1.2 Human Planner
We selected a rule-based human model from previous work [35], which myopically selects the highest priority subtask
based on the world state. The model does not reason over an horizon of subtasks and does not account for the robot’s
actions. Empirically, we found the model to perform adequately most of the time, when users choose actions quickly.
6.5 Environments
We performed four different experiments to demonstrate that our proposed framework generates a variety of environments that result in a diverse set of coordination behaviors.
In all experiments we use the same performance metric f, which is a function of both the number of completed
orders and the amount of time each order was completed.
140
6.5.1 Workload Distributions with Centralized Planning.
We generate environments that result in a broad spectrum of workload distributions, such as environments where only
one agent does all the work and environments where both agents contribute evenly to the task. Both agents execute a
precomputed centralized joint plan.
To assess differences in workloads, we specify as BCs the differences (robot minus human) in the number of actions
performed for each object type: number of ingredients (onions) held, number of plates held, and number of orders
delivered.
Fig. 6.6 shows the generated archive: we illustrate the 3D behavior space as a series of five 2D spaces, one for each
value of the difference in orders. Each colored cell represents an environment with BCs computed by simulating the
two agents in that environment. Lighter colors indicate higher performance f.
We observe that when the difference in orders is −1 or +1, performance is low; this is expected since there are only
2 orders to deliver, thus an absolute difference of +1 means that only one order was delivered.
We simulate the two agents in environments of the extreme regions of the archive to inspect how these environments
affect the resulting workload distributions. For instance, either the robot (green agent) or the simulated human (blue
agent) did all of the work in environments (1) and (2) of Fig. 6.6 respectively. We observe that in these environments
the dish dispenser, onion dispenser, cooking pot and serving counter are aligned in a narrow corridor. The optimal joint
plan is, indeed, that the agent inside the corridor picks up the onions, places them in the pot, picks up the plate and
delivers the onions.
On the other hand, in environments (3) and (4), the workload was distributed exactly or almost exactly evenly.
We see that in these environments, both agents have easy access to the objects. Additionally, all objects are placed
next to each other. This is intentional, since CMA-ME attempts to fill the archive with diverse environments that each
maximize the specified performance metric. Since performance is higher when all orders are delivered in the minimum
amount of time, positioning the objects next to each other results in shorter time to completion.
This object configuration works well in centralized planning since the agents precompute their actions in advance,
and there are no issues from lack of coordination. We observe that this is not the case in the human-aware planning
experiments below.
141
3
1
4
2
(1) (2)
(3) (4)
Figure 6.6: Archive of environments with different workload distributions for the centralized planning agents and four
example environments corresponding to different cells in the archive. Environments (1,2) resulted in uneven workload
distributions, while environments (3,4) resulted in even workload distributions. We annotate four environments from
the archive. The bar shows the normalized value of the objective f.
6.5.2 Workload Distributions with Human-Aware Planning
In this experiment, the human and robot do not execute a precomputed centralized joint plan. Instead, the robot executes
a QMDP policy and the human executes a myopic policy.
We run two experiments: In the first experiment, we generate environments that maximize the performance metric
f, identical to Section 6.4. In the second, we attempt to find environments that minimize the performance metric. The
latter is useful when searching for failure cases of developed algorithms [85]. We are specifically interested in drops in
performance that arise from the assumptions of the QMDP formulation, rather than, for example, poor performance
because objects are too far from each other. Therefore, for the second experiment we use as a baseline the performance
of the team when the robot executes an MDP policy that fully observes the human subtask, and we maximize the
difference in performance between simulations with the MDP policy and the QMDP policy.
We note that in decentralized planning, the two agents may get “stuck” trying to reach the same object. We adopt a
rule-based mechanism from previous work [35] that selects random actions for both agents until they get unstuck. While
the MDP, QMDP, and myopic human policies are deterministic, the outcomes are stochastic because of these random
actions. Therefore, we run multiple trials in the same environment, and we empirically estimate the performance and
BCs with their median values.
142
4
3
1
2
(1) (2)
(3) (4)
Figure 6.7: Archive of environments with different workload distributions of a QMDP planned robot and a simulated
myopic human.
6.5.2.1 Maximizing Performance
Inspecting the environments in the extremes of the generated archive (Fig. 6.7) reveals interesting object layouts. For
example, all objects in environment (1) are aligned next to each other, and the agent that gets first in front of the leftmost
onion dispenser “blocks” the path to the other agent and completes all the tasks on its own. In environment (2), the
robot starts the task next to the onion dispenser and above the pot. Hence, it does all the onion deliveries, while the
human picks up the onions and delivers the order by moving below the pot.
In environments (3) and (4), each agent can do the task independently, because each team member is close to their
own pot. This results in even workload distribution. The two agents achieve high performance, since no delays arise
from lack of coordination.
6.5.2.2 Minimizing Performance
In the generated archive of Fig. 6.8, lighter colors indicate lower performance of the team of the QMDP robot and the
myopic human compared to an MDP robot and a myopic human. We are particularly interested in the environments
where the team fails to complete the task.
In environment (1) of Fig. 6.8, the simulated human picks up an onion at exactly the same time step that the robot
delivers the third onion to the pot. There is now no empty pot to deliver the onion, so the human defaults to going to the
pot and waiting there, blocking the path of the robot. The environment leads to an interesting edge case that was not
143
1
2
(1a) (1b)
(2a) (2b)
Figure 6.8: Archive of environments when attempting to minimize the performance of a QMDP robot (green agent) and
a simulated myopic human (blue agent). Lighter color indicates lower performance. (1a) and (1b) show successive
frame sequences for environment (1), and similarly (2a), (2b) for environment (2).
accounted for in the hand-designed human model but is revealed by attempting to minimize the performance of the
agents.
In environment (2) of Fig. 6.8, the two agents get stuck in the narrow corridor in front of the rightmost onion
dispenser. Due to the “auto-unstuck” mechanism, the simulated human goes backward towards the onion dispenser.
The QMDP planner, which uses the change of distance to the subtask goal location as observation (see section 6.4.1.1),
erroneously infers that the human subtask is to reach the onion dispenser, and does not move backwards to allow the
human to go to the dish dispenser. This environment highlights a limitation of the distance-based observation function
since it is not robust to random motions that occur when the two agents get stuck.
Overall, we observe that when minimizing performance, the generated environments reveal edge cases that can help
a designer better understand, debug, and improve the agent models.
6.5.3 Team Fluency with Human-Aware Planning
An important aspect of the quality of the interaction between two agents is their team fluency. One team fluency metric
is the concurrent motion (also defined as concurrent activity), which is defined in [135] as “the percentage of time out
of the total task time, during which both agents have been active concurrently.”
We include as second metric the number of time steps the agents are “stuck,” which occurs when both agents are in
the same position and orientation for two successive time steps. We use the human-aware planning models as in the
144
2
1
3
4
(1) (2)
(3) (4)
Figure 6.9: Archive of environments with different team fluency metrics. Environments (1) and (2) resulted in low team
fluency, while (3) and (4) resulted in high team fluency.
Workload Distribution Max Workload Distribution Min Team Fluency
ϵ Diff. in Ingredients Diff. in Plates Diff. in Orders Diff. in Ingredients Diff. in Plates Diff. in Orders Concurrent Motion Stuck
0.00 0.85 0.67 0.66 0.85 0.78 0.76 0.88 0.92
0.05 0.76 0.55 0.50 0.79 0.63 0.61 0.77 0.85
0.10 0.68 0.46 0.39 0.72 0.52 0.50 0.67 0.76
0.20 0.56 0.33 0.23 0.63 0.40 0.35 0.52 0.62
0.50 0.46 0.22 0.08 0.39 0.26 0.11 0.34 0.30
Table 6.1: Spearman’s rank-order correlation coefficients between the computed BCs and the initial placement of
environments in the archive for increasing levels of noise ϵ in human inputs.
previous experiment, and we search for environments that maximize team performance but are diverse with respect to
the concurrent motion and time stuck of the agents.
We observe in the generated archive (Fig. 6.9) that environments with higher concurrent motion have better
performance, since the agents did not spend much time waiting. For example, environments (3) and (4) result in very
high team fluency. These environments have two pots and two onion dispensers, which are easily accessible.
On the other hand, example environments (1) and (2) have poor team fluency. These environments have long
corridors, and one agent needs to wait a long time for the second agent to get out of the corridor. In environment (1), the
two agents get stuck when the myopic human attempts to head towards the onion dispenser ignoring the robot, while
the QMDP agent incorrectly assumes that the human gives way to the robot.
145
6.6 Robustness to Human Model
When evaluating the generated environments in human-aware planning, we assumed a myopic human model. We wish
to test the robustness of the associated coordination behaviors with respect to the model: if we add noise, will the
position of the environments in the archive change?
Therefore, for each cell in the archives from Fig. 6.7 and 6.8, and for 100 randomly selected cells in the archive
of Fig. 6.9, we compute the BCs for increasing levels of noise in the human actions. We simulate noisy humans by
using an ϵ-myopic human model, where the human follows the myopic model with a probability of 1 − ϵ and takes a
uniformly random action otherwise. We then compute the Spearman’s rank-order correlation coefficient between the
initial position of each environment in the archive and the new position, specified by the computed BCs, in the presence
of noise.
Table 6.1 shows the computed correlation coefficients for each BC and for increasing values of ϵ. The “Workload
Distribution Max,” “Workload Distribution Min,” and “Team Fluency” refer to the archives of Fig. 6.7, 6.8 and Fig. 6.9.
All values are statistically significant with Bonferroni correction (p < 0.001).
We observe that even when ϵ = 0, the correlation is strong but not perfect, since there is randomness in the computed
BCs because of the “auto-stuck” mechanism. Values of ϵ = 0.05 and 0.1 result in moderate correlation between the
initial and new position of the environments in the archive. The correlation appears to be stronger for the difference
in ingredients. This is because the environments with extreme values of this BC (+6, −6) had nearly zero variance
since one agent would consistently “block” the other agent from accessing the onion dispenser. As expected, when the
simulated human becomes random 50% of the time (ϵ = 0.5), there is only a weak, albeit still significant, correlation.
6.7 User Study
Equipped with the findings from section 6.6, we want to assess whether the differences in coordination observed in
simulation translate to actual differences when the simulated robot interacts with real users.
We selected 12 environments from the generated archives in the human-aware planning experiments of section 6.5,
including 3 “even workload” environments, 3 “uneven workload” environments”, 3 “high team fluency” environments
and 3 “low team fluency” environments.
146
6.7.1 Procedure.
Participants conducted the study remotely by logging into a server while video conferencing with the experimenter.
Users controlled the human agent with the keyboard and interacted with the QMDP robot. The experimenter first
instructed them in the task and asked them to complete three training sessions, where in the first two they practiced
controlling the human with their keyboard and in the third they practiced collaborating with the robot. They then
performed the task in all 12 environments in randomized order (within subjects design). We asked all participants to
complete the tasks as quickly as possible.
6.7.2 Participants.
We recruited 27 participants from the local community (ages 20-30, M=23.81, SD=2.21). Each participant was
compensated $8 for completing the study, which lasted approximately 20 minutes.
6.7.3 Hypotheses.
H1. The difference in the workloads between the human and the robot will be larger in the “uneven workload”
environments, compared to the “even workload” environments.
H2. The team fluency of the human-robot team will be better in the “high team fluency” environments, compared to the
“low team fluency” environments.
6.7.4 Dependent Measures.
Identically to the experiments in section 6.5, we computed the following BCs: the difference in the ingredients, plates,
and orders from the playthroughs in the “even/uneven workload” environments, and the percentage of concurrent
motion and time stuck in the “low/high team fluency” environments.
We used the average over the BCs computed from the three environments of the same type, e.g., the three “even
workload” environments, as an aggregate measure. For the analysis, we used the absolute values of the workload
differences, since we are interested in whether the workload is even or uneven and not which agent performed most of
the actions.
147
6.7.5 Analysis.
A Wilcoxon signed-rank test determined that two out of the three workload BCs (difference in ingredients: z =
3.968, p < 0.001, difference in orders: z = 2.568, p = 0.01) were significantly larger in the “uneven workload”,
compared to the “even workload” environments.
Additionally, a Wilcoxon signed-rank test showed that the percentage of concurrent motion was significantly higher
for the high team fluency environments, compared to the low team fluency ones (z = 4.541, p < 0.001). There were no
significant differences in the time stuck, since users resolved stuck situations quickly by giving way to the robot.
These results support our hypotheses and show that the differences in the coordination observed in simulation
translate to differences observed in interactions with actual users.
6.8 Discussion.
6.8.1 Limitations.
Our work is limited in many ways. Our user study was conducted online with a simulated robot. Our future goal is
to evaluate our framework on human-robot experiments in a real-world collaborative cooking setting, where users
are exposed to different scenes. Further experiments with a variety of robot and human models would expand the
diversity of the generated environments and the observed behaviors. Another limitation is that, while the framework is
agnostic of the agents’ models, our framework requires human input for specifying the behavior characteristics and
MIP constraints. Automating part of the solvability specification is an exciting area for future work.
6.8.2 Implications.
We envision our framework as a method to help evaluate human-robot coordination in the future, as well as a reliable
tool to help practitioners debug or tune their coordination algorithms. More generally, our framework can facilitate
understanding of complex human-aware algorithms executing in complex environments. We are excited about future
work that highlights diverse behaviors in different settings where coordination is essential, such as manufacturing and
assistive care. Finally, we hope that our work will guide future human-robot coordination research to consider the
environment as a significant factor in coordination problems.
148
Chapter 7
Sample Efficient Scenario Generation via Deep Surrogate Models
7.1 Introduction
* We present an efficient method of automatically generating a collection of environments that elicit diverse agent
behaviors. As a motivating example, consider deploying a robot agent at scale in a variety of home environments. The
robot should generalize by performing robustly not only in test homes, but in any end user’s home. To validate agent
generalization, the test environments should have good coverage for the robot agent. However, obtaining such coverage
may be difficult, as the generated environments would depend on the application domain, e.g. kitchen or living room,
and on the specific agent we want to test, since different agents exhibit different behaviors.
To enable generalization of autonomous agents to new environments with differing levels of complexity, previous
work on open-ended learning [282, 283] has integrated the environment generation and the agent training processes.
The interplay between the two processes acts as a natural curriculum for the agents to learn robust skills that generalize
to new, unseen environments [60, 220, 64]. The performance of these agents has been evaluated either in environments
from the training distribution [282, 283, 64] or in suites of manually authored environments [60, 150, 220].
As a step towards testing generalizable agents, there has been increasing interest in competitions [225, 121] that
require agents to generalize to new game layouts. Despite the recent progress of deep learning agents in fixed game
domains, e.g. in Chess [249], Go [248], Starcraft [278], and Poker [200, 32], it has been rule-based agents that have
succeeded in these competitions [121]. Such competitions also rely on manually authored game levels as a test set,
handcrafted by a human designer.
*Work led also by Varun Bhatt and Bryon Tjanaka at the ICAROS Lab.
149
Sample Solution
Update
Surrogate
Archive Extract Predicted
Agent Performance and
Measured Behavior
Generate Environments from
Solutions and Evaluate
Extract Ground Truth Agent
Performance and Behavior Data
Downsample Archive
Update Dataset
Update Ground Truth Archive
Generate Environment
Train the Surrogate Model
on the Dataset
Environment Generation Occupancy Prediction Stack with Occupancy Grid
Self-supervised Agent
Occupancy Prediction
A
A
Model Exploitation
Agent Simulation
Model Improvement
Figure 7.1: An overview of the Deep Surrogate Assisted Generation of Environments (DSAGE) algorithm. The
algorithm begins by generating and evaluating random environments to initialize the dataset and the surrogate model
(not shown in the figure). An archive of solutions is generated by exploiting a deep surrogate model (blue arrows)
with a QD optimizer, e.g., CMA-ME [91]. A subset of solutions from this archive are chosen by downsampling and
evaluated by generating the corresponding environment and simulating an agent (red arrows). The surrogate model is
then trained on the data from the simulations (yellow arrows). While the images show Mario levels, the algorithm
structure is similar for mazes.
While manually authored environments are important for standardized testing, creating these environments can be
tedious and time-consuming. Additionally, manually authored test suites are often insufficient for eliciting the diverse
range of possible agent behaviors. Instead, we would like an interactive test set that proposes an environment, observes
the agent’s performance and behavior, and then proposes new environments that diversify the agent behaviors, based on
what the system has learned from previous execution traces of the agent.
To address collecting environments with diverse agent behaviors, prior work frames the problem as a quality
diversity (QD) problem [88, 89, 85]. A QD problem consists of an objective function, e.g. whether the agent can solve
the environment, and measure functions, e.g. how long the agent takes to complete their task. The measure functions
150
quantify the behavior we would like to vary in the agent, allowing practitioners to specify the case coverage they would
like to see in the domain they are testing. While QD algorithms can generate diverse collections of environments, they
require a large number of environment evaluations to produce the collection, and each of these evaluations requires
multiple time-consuming simulated executions of potentially stochastic agent policies.
We study how deep surrogate models that predict agent performance can accelerate the generation of environments
that are diverse in agent behaviors. We draw upon insights from model-based quality diversity algorithms that have
been previously shown to improve sample efficiency in design optimization [97] and Hearthstone deckbuilding [292].
Environments present a much more complex prediction task because the evaluation of environments involves simulating
stochastic agent policies, and small changes in the environment may result in large changes in the emergent agent
behaviors [261].
We make the following contributions: (1) We propose the use of deep surrogate models to predict agent performance
in new environments. Our algorithm, Deep Surrogate Assisted Generation of Environments (DSAGE) (Fig. 7.1),
integrates deep surrogate models into quality diversity optimization to efficiently generate diverse environments. (2) We
show in two benchmark domains from previous work, a Maze domain [60, 220] with a trained ACCEL agent [220] and a
Mario domain [269, 89] with an A* agent [17], that DSAGE outperforms state-of-the-art QD algorithms in discovering
diverse agent behaviors. (3) We show with ablation studies that training the surrogate model with ancillary agent
behavior data and downsampling a subset of solutions from the surrogate archive results in substantial improvements in
performance, compared to the surrogate models of previous work [292].
7.2 Problem Definition
7.2.1 Quality diversity (QD) optimization.
We adopt the QD problem definition from previous work [87]. A QD optimization problem specifies an objective
function f : R
n → R and a joint measure function m : R
n → R
m. For each element s ∈ S, where S ⊆ R
m is the
range of the measure function, the QD goal is to find a solution θ ∈ R
n such that m(θ) = s and f(θ) is maximized.
Since the range of the measure function can be continuous, we restrict ourselves to algorithms from the MAP-Elites
family [53, 202] that discretize this space into a finite number of M cells. A solution θ is mapped to a cell based
151
on its measure m(θ). The solutions that occupy cells form an archive of solutions. Our goal is to find solutions
θi, i ∈ {1, ..., M} that maximize the objective f for all cells in the measure space.
max
θi
X
M
i=1
f(θi) (7.1)
The computed sum in Eq. 7.1 is defined as the QD-Score [229], where empty cells have an objective value of 0. A
second metric of the performance of a QD algorithm is coverage of the measure space, defined as the proportion of
cells that are filled in by solutions: 1
M
PM
i=1 1θi
.
7.2.2 QD for environment generation.
We assume a single agent acting in an environment parameterized by θ ∈ R
n. The environment parameters can be
locations of different objects or latent variables that are passed as inputs to a generative model [110].† A QD algorithm
generates new solutions θ and evaluates them by simulating the agent on the environment parameterized by θ. The
evaluation returns an objective value f and measure values m. The QD algorithm attempts to generate environments
that maximize f but are diverse with respect to the measures m.
7.3 Background and Related Work
7.3.1 Quality diversity (QD) optimization.
QD optimization originated in the genetic algorithm community with diversity optimization [178], the predecessor to QD.
Later work introduced objectives to diversity optimization and resulted in the first QD algorithms: Novelty Search with
Local Competition [179] and MAP-Elites [202, 53]. The QD community has grown beyond its genetic algorithm roots,
with algorithms being proposed based on gradient ascent [87], Bayesian optimization [163], differential evolution [43],
and evolution strategies [91, 47, 46]. QD algorithms have applications in damage recovery in robotics [53], reinforcement
learning [267, 211], and generative design [81, 118].
Among the QD algorithms, those of particular interest to us are the model-based ones. Current model-based [15,
199] QD algorithms either (1) learn a surrogate model of the objective and measure functions [97, 118, 36], e.g. a
Gaussian process or neural network, (2) learn a generative model of the representation parameters [98, 232], or (3) draw
†For consistency with the generative model literature, we use z instead of θ when denoting latent vectors
152
inspiration from model-based RL [162, 185]. In particular, Deep Surrogate Assisted MAP-Elites (DSA-ME) [292]
trains a deep surrogate model on a diverse dataset of solutions generated by MAP-Elites and then leverages the model
to guide MAP-Elites. However, DSA-ME has only been applied to Hearthstone deck building, a simpler prediction
problem than predicting agent behavior in generated environments. Additionally, DSA-ME is specific to MAP-Elites
only and cannot run other QD algorithms to exploit the surrogate model. Furthermore, DSA-ME is restricted to direct
search and cannot integrate generative models to generate environments that match a provided dataset.
7.3.2 Automatic environment generation.
Automatic environment generation algorithms have been proposed in a variety of fields. Methods between multiple
communities often share generation techniques, but differ in how each community applies the generation algorithms.
For example, in the procedural content generation (PCG) field [246], an environment generator produces video game
levels that result in player enjoyment. Since diversity of player experience and game mechanics is valued in games,
many level generation systems incorporate QD optimization [113, 89, 70, 165, 259, 244, 243]. The procedural content
generation via machine learning (PCGML) [263, 187] subfield studies environment generators that incorporate machine
learning techniques such as Markov Chains [255], probabilistic graphical models [117], LSTMs [262], generative
models [280, 104, 271, 243], and reinforcement learning [164, 70]. Prior work [156] has leveraged surrogate models
trained on offline data to accelerate search-based PCG [270].
7.4 Deep Surrogate Assisted Generation of Environments (DSAGE)
7.4.1 Algorithm.
We propose the Deep Surrogate Assisted Generation of Environments (DSAGE) algorithm for discovering environments
that elicit diverse agent behaviors. Akin to the MAP-Elites family of QD algorithms, DSAGE maintains a ground-truth
archive where solutions are stored based on their ground-truth evaluations. Simultaneously, DSAGE also trains and
exploits a deep surrogate model for predicting the behavior of a fixed agent in new environments. The QD optimization
occurs in three phases that take place in an outer loop: model exploitation, agent simulation, and model improvement
(Fig. 7.1). Algorithm 9 provides the pseudocode for the DSAGE algorithm.
153
The model exploitation phase (lines 40–49) is an inner loop that leverages existing QD optimization algorithms and
the predictions of the deep surrogate model to build an archive – referred to as the surrogate archive – of solutions. The
first step of this phase is to query a list of B candidate solutions through the QD algorithm’s ask method. These solutions
are environment parameters, e.g., latent vectors of a GAN, which are passed through the environment generator, e.g., a
GAN, to create an environment (line 44). Next, we make predictions with the surrogate model. The surrogate model first
predicts data representing the agent’s behavior, e.g., the probability of occupying each discretized tile in the environment
(line 45), referred to as “ancillary agent behavior data” (y). The predicted ancillary agent behavior data (yˆ) then guides
the surrogate model’s downstream prediction of the objective ( ˆf) and the measure values (mˆ ) (line 46). Finally, the QD
algorithm’s tell method adds the solution to the surrogate archive based on the predicted objective and measure values.
Note that since DSAGE is independent of the QD algorithm, the ask and tell methods abstract out the QD algorithm’s
details. For example, when the QD algorithm is MAP-Elites or CMA-ME, tell adds solutions if the cell in the measure
space that they belong to is empty or if the existing solution in that cell has a lower objective. For CMA-ME, tell also
includes updating internal CMA-ES parameters.
The agent simulation phase (lines 51–59) inserts a subset of solutions from the surrogate archive into the groundtruth archive. This phase begins by selecting the subset of solutions from the surrogate archive (line 51). The selected
solutions are evaluated by generating the corresponding environment (line 54) and simulating a fixed agent to obtain the
true objective and measure values, as well as ancillary agent behavior data (line 55). Evaluation data is appended to
the dataset, and solutions that improve their corresponding cell in the ground-truth archive are added to that archive
(lines 56, 57).
In the model improvement phase (line 60), the surrogate model is trained in a self-supervised manner through the
supervision provided by the agent simulations and the ancillary agent behavior data.
The algorithm is initialized by generating random solutions and simulating the agent in the corresponding environments (lines 31-37). Subsequently, every outer iteration (lines 39-61) consists of model exploitation followed by agent
simulation and ending with model improvement.
154
Algorithm 9 Deep Surrogate Assisted Generation of Environments (DSAGE)
Input: N: Maximum number of evaluations, nrand: Number of initial random solutions, Nexploit: Number of
iterations in the model exploitation phase, B: Batch size for the model exploitation QD optimizer
Output: Final version of the ground-truth archive Agt
30 Initialize the ground-truth archive Agt, the dataset D, and the deep surrogate model sm
31 Θ ← generate random solutions(nrand)
32 for θ ∈ Θ do
33 env ← g(θ)
34 f,m, y ← evaluate(env)
35 D ← D ∪ (θ, f,m, y)
36 Agt ← add solution(Agt,(θ, f,m))
37 end
38 evals ← nrand
39 while evals < N do
40 Initialize a QD optimizer qd with the surrogate archive Asurrogate
41 for itr ∈ {1, 2, . . . , Nexploit} do
42 Θ ← qd.ask(B)
43 for θ ∈ Θ do
44 env ← g(θ)
45 yˆ ← sm.predict ancillary(env)
46 ˆf,mˆ ← sm.predict(env, yˆ)
47 qd.tell(θ,
ˆf,mˆ )
48 end
49 end
50
51 Θ ← select solutions(Asurrogate)
52
53 for θ ∈ Θ do
54 env ← g(θ)
55 f,m, y ← evaluate(env)
56 D ← D ∪ (θ, f,m, y)
57 Agt ← add solution(Agt,(θ, f,m))
58 evals ← evals + 1
59 end
60 sm.train(D)
Model Exploitation
Agent Simulation
Model Improvement
61 end
155
7.4.2 Self-supervised prediction of ancillary agent behavior data.
By default, a surrogate model directly predicts the objective and measure values based on the initial state of the
environment and the agent (provided in the form of a one-hot encoded image). However, we anticipate that direct
prediction will be challenging in some domains, as it requires understanding the agent’s trajectory in the environment.
Thus, we provide additional supervision to the surrogate model in DSAGE via a two-stage self-supervised process.
First, a deep neural network predicts ancillary agent behavior data. In our work, we obtain this data by recording
the expected number of times the agent visits each discretized tile in the environment, resulting in an “occupancy grid.”
We then concatenate the predicted ancillary information, i.e., the predicted occupancy grid, with the one-hot encoded
image of the environment and pass them through another deep neural network to obtain the predicted objective and
measure values. We use CNNs for both predictors. As a baseline, we compare our model with a CNN that directly
predicts the objective and measure values without the help of ancillary data.
7.5 Domains
We test our algorithms in two benchmark domains from prior work: a Maze domain [41, 60, 220] with a trained ACCEL
agent [220] and a Mario domain [269, 89] with an A* agent [17]. We select these domains because, despite their
relative simplicity (each environment is represented as a 2D grid of tiles), agents in these environments exhibit complex
and diverse behaviors.
In the Maze domain, we directly search for different mazes, with the QD algorithm returning the layout of the maze.
In the Mario domain, we search for latent codes that are passed through a pre-trained GAN, similar to the corresponding
previous work.
We select the objective and measure functions as described below. Since the agent or the environment dynamics are
stochastic in each domain, we average the objective and measure values over 50 episodes in the Maze domain and 5
episodes in the Mario domain.
156
7.5.1 Maze.
We set a binary objective function f that is 1 if the generated environment is solvable and 0 otherwise, indicating the
validity of the environment. Since we wish to generate visually diverse levels that offer a range of difficulty level for the
agent, we select as measures (1) number of wall cells (range: [0, 256]), and (2) mean agent path length (range: [0, 648],
where 648 indicates a failure to reach the goal).
7.5.2 Mario.
Since we wish to generate playable levels, we set the objective as the completion rate, i.e., the proportion of the level that
the agent completes before dying. We additionally want to generate environments that result in qualitatively different
agent behaviors, thus we selected as measures: (1) sky tiles, the number of tiles of a certain type that are in the top half
of the 2D grid (range: [0, 150]), (2) number of jumps, the number of times that the A* agent jumps during its execution
(range: [0, 100]).
7.6 Experiments
7.6.1 Experiment Design
Independent variables. In each domain (Maze and Mario), we follow a between-groups design, where the independent
variable is the algorithm. We test the following algorithms:
DSAGE: The proposed algorithm that includes predicting ancillary agent behavior data and downsampling the
surrogate archive (Algorithm 7.4).
DSAGE-Only Anc: The proposed algorithm with ancillary data prediction and no downsampling, i.e., selecting all
solutions from the surrogate archive.
DSAGE-Only Down: The proposed algorithm with downsampling and no ancillary data prediction.
DSAGE Basic: The basic version of the proposed algorithm that selects all solutions from the surrogate archive and
does not predict ancillary data.
157
(a) Maze
0 50k 100k
Evaluations
0
10k
20k
QD-score
0 50k 100k
Evaluations
0.0
0.2
0.4
Archive Coverage
DSAGE
DSAGE-Only Anc
DSAGE-Only Down
DSAGE Basic
MAP-Elites
DR
Algorithm QD-Score Archive Coverage
DSAGE 16,446.60 ± 42.27 0.40 ± 0.00
DSAGE-Only Anc 14,568.00 ± 434.56 0.35 ± 0.01
DSAGE-Only Down 14,205.20 ± 40.86 0.34 ± 0.00
DSAGE Basic 11,740.00 ± 84.13 0.28 ± 0.00
MAP-Elites 10,480.80 ± 150.13 0.25 ± 0.00
DR 5,199.60 ± 30.32 0.13 ± 0.00
(b) Mario
0 5k 10k
Evaluations
0
2.5k
5k
QD-score
0 5k 10k
Evaluations
0.0
0.2
0.4
Archive Coverage
DSAGE
DSAGE-Only Anc
DSAGE-Only Down
DSAGE Basic
CMA-ME
DR
Algorithm QD-Score Archive Coverage
DSAGE 4,362.29 ± 72.54 0.30 ± 0.00
DSAGE-Only Anc 2,045.28 ± 201.64 0.16 ± 0.01
DSAGE-Only Down 4,067.42 ± 102.06 0.30 ± 0.01
DSAGE Basic 1,306.11 ± 50.90 0.11 ± 0.01
CMA-ME 1,840.17 ± 95.76 0.13 ± 0.01
DR 92.75 ± 3.01 0.01 ± 0.00
Figure 7.2: QD-Score and archive coverage attained by baseline QD algorithms and DSAGE in the Maze and Mario
domains over 5 trials. Tables and plots show mean and standard error of the mean.
Baseline QD: The QD algorithm without surrogate assistance. We follow previous work [89] and use CMA-ME
for the Mario domain. Since CMA-ME operates only in continuous spaces, we use MAP-Elites in the discrete Maze
domain.
Domain Randomization (DR) [146, 240, 268]: Algorithm that generates and evaluates random solutions, i.e., wall
locations in the maze domain and the latent code to pass through the GAN in the Mario domain.
Dependent variables. We measure the quality and diversity of the solutions with the QD-Score metric [229](Eq. 7.1).
As an additional metric of diversity, we also report the archive coverage. We run each algorithm for 5 trials in each
domain.
Hypothesis. We hypothesize that DSAGE will result in a better QD-Score than DSAGE Basic in all domains, which in
turn will result in better performance than the baseline QD algorithm. DSAGE, DSAGE Basic, and the baseline QD
algorithm will all exceed DR. We base this hypothesis on previous work [202, 85] which shows that QD algorithms
outperform random sampling in a variety of domains, as well as previous work [97, 292] which shows that surrogateassisted MAP-Elites outperforms standard MAP-Elites in design optimization and Hearthstone domains. Furthermore,
we expect that the additional supervision through ancillary agent behavior data and downsampling will result in DSAGE
performing significantly better than DSAGE Basic.
158
Table 7.1: Number of evaluations required to reach a QD-Score of 10480.8 in the Maze domain and 1306.11 in the
Mario domain.
(a) Maze
Algorithm Evaluations
DSAGE 33,930.40 ± 1,411.04
DSAGE-Only Anc 51,919.60 ± 8,254.24
DSAGE-Only Down 42,816.60 ± 691.38
DSAGE Basic 85,328.60 ± 2,947.24
MAP-Elites 100,000
(b) Mario
Algorithm Evaluations
DSAGE 2,464.40 ± 356.36
DSAGE-Only Anc 7,727.40 ± 1,433.33
DSAGE-Only Down 2,768.60 ± 586.34
DSAGE Basic 10,000
CMA-ME 5,760.00 ± 516.14
7.6.2 Analysis
Fig. 7.2 summarizes the results obtained by the six algorithms on the Maze and the Mario domains.
One-way ANOVA tests showed a significant effect of the algorithm on the QD-Score for the Maze (F(5, 24) =
430.98, p < 0.001) and Mario (F(5, 24) = 238.09, p < 0.001) domains.
Post-hoc pairwise comparisons with Bonferroni corrections showed that DSAGE outperformed DSAGE Basic,
Baseline QD, and DR in both the Maze and the Mario domains (p < 0.001). Additionally, DSAGE Basic outperformed
MAP-Elites and DR in the Maze domain (p < 0.001), while it performed significantly worse than the QD counterpart,
CMA-ME, in the Mario domain (p = 0.003). Finally, Baseline QD outperformed DR in both the Maze and Mario
domains (p < 0.001).
These results show that deep surrogate assisted generation of environments results in significant improvements
compared to quality diversity algorithms without surrogate assistance. They also show that adding ancillary agent
behavior data and downsampling is important in both domains. Without these components, DSAGE Basic has limited
or no improvement compared to the QD algorithm without surrogate assistance. Additionally, domain randomization
is significantly worse than DSAGE as well as the baselines. The archive coverage and consequently the QD-Score is
negligible in the Mario domain since randomly sampled latent codes led to little diversity in the levels.
Table 7.1 shows another metric of the speed-up provided by DSAGE: the number of evaluations (agent simulations)
required to reach a fixed QD-Score. We set this fixed QD-Score to be 10480.8 in the Maze domain and 1306.11 in
the Mario domain, which are the mean QD-Scores of MAP-Elites and DSAGE Basic, respectively, in those domains.
DSAGE reaches these QD-Scores faster than the baselines do.
To assess the quality of the trained surrogate model, we create a combined dataset consisting of data from one run
of each surrogate assisted algorithm. We use this dataset to evaluate the surrogate models trained from separate runs of
159
Table 7.2: Mean absolute error of the objective and measure predictions by the surrogate models.
Maze Mario
Algorithm Objective
MAE
Number of Wall
Cells MAE
Mean Agent Path
Length MAE
Objective
MAE
Number of Sky
Tiles MAE
Number of Jumps
MAE
DSAGE 0.03 0.37 96.58 0.10 1.10 7.16
DSAGE-Only Anc 0.04 0.96 95.14 0.20 1.11 9.97
DSAGE-Only Down 0.10 0.95 151.50 0.11 0.87 6.52
DSAGE Basic 0.18 5.48 157.69 0.20 2.16 10.71
DSAGE and its variants. Table 7.2 shows the mean absolute error (MAE) of the predictions by the surrogate models.
The model learned by DSAGE Basic fails to predict the agent-based measures well. It has an MAE of 157.69 for the
mean agent path length in Maze and MAE = 10.71 for the number of jumps in Mario. In contrast, the model learned by
DSAGE makes more accurate predictions, with MAE = 96.58 for mean agent path length and MAE = 7.16 for number
of jumps.
7.6.3 Ablation Study
Algorithm 7.4 describes two key components of DSAGE: (1) self-supervised prediction of ancillary agent behavior
data, and (2) downsampling to select solutions from the surrogate archive. We perform an ablation study by treating the
inclusion of ancillary data prediction (ancillary data / no ancillary data) and the method of selecting solutions from
the surrogate archive (downsampling / full selection) as independent variables. A two-way ANOVA for each domain
showed no significant interaction effects. We perform a main effects analysis for each independent variable.
7.6.3.1 Inclusion of ancillary data prediction.
A main effects analysis for the inclusion of ancillary data prediction showed that algorithms that predict ancillary agent
behavior data (DSAGE, DSAGE-Only Anc) performed significantly better than their counterparts with no ancillary
data prediction (DSAGE-Only Down, DSAGE Basic) in both domains (p < 0.001).
Figure 7.2 shows that predicting ancillary agent behavior data also resulted in a larger mean coverage for Maze,
while it has little or no improvement for Mario. Additionally, as shown in Table 7.2, predicting ancillary agent behavior
data helped improve the prediction of the mean agent path length in the Maze domain but provided little improvement
to the prediction of the number of jumps in the Mario domain. The reason is that in the Maze domain, the mean agent
path length is a scaled version of the sum of the agent’s tile occupancy frequency, hence the two-stage process which
160
predicts the occupancy grid first is essential for improving the accuracy of the model. On the other hand, the presence
of a jump in Mario depends not only on cell occupancy, but also on the structure of the level and the sequence of the
occupied cells.
7.6.3.2 Method of selecting solutions from the surrogate archive.
A main effects analysis for the method of selecting solutions from the surrogate archive showed that the algorithms
with downsampling (DSAGE, DSAGE-Only Down) performed significantly better than their counterparts with no
downsampling (DSAGE-Only Anc, DSAGE Basic) in both domains (p < 0.001).
A major advantage of downsampling is that it decreases the number of ground-truth evaluations in each outer
iteration. Thus, for a fixed evaluation budget, downsampling results in a greater number of outer iterations. For
instance, in the Maze domain, runs without downsampling had only 6-7 outer iterations, while runs with downsampling
had approximately 220 outer iterations. More outer iterations leads to more training and thus higher accuracy of the
surrogate model. In turn, a more accurate surrogate model will generate a better surrogate archive in the inner loop.
We ran an ablation to test between two possible explanations for why having more outer iterations helps with
performance: (1) larger number of training epochs, (2) more updates to the dataset allowing the surrogate model to
iteratively correct its own errors. We observed that iterative correction accounted for most of the performance increase
with downsampling.
The second advantage of downsampling is that it selects solutions evenly from all regions of the measure space,
thus creating a more balanced dataset. This helps train the surrogate model in parts of the measure space that are not
frequently visited. We compared against an algorithm in which we select a subset of solutions uniformly at random from
the surrogate archive instead of downsampling. We observe that downsampling has a slight advantage over uniform
random sampling in the Maze domain.
Furthermore, if instead of downsampling we sampled multiple solutions from nearby regions of the surrogate
archive, the prediction errors could cause the solutions to collapse to a single cell in the ground-truth archive, resulting
in many solutions being discarded.
Overall, our ablation study shows that both predicting the occupancy grid as ancillary data and downsampling the
surrogate archive independently help improve the performance of DSAGE.
161
Number of wall cells
200 0
200
300
500
600
400
0
Mean agent path length
100
0.0
0.2
0.4
0.6
0.8
1.0
(a)
Number of wall cells: 43
Mean agent path length: MAX
(c)
Number of wall cells: 50
Mean agent path length: 81
(b)
Number of wall cells: 72
Mean agent path length: 610
(d)
Number of wall cells: 100
Mean agent path length: 297
(g)
Number of wall cells: 121
Mean agent path length: 200
(h)
Number of wall cells: 214
Mean agent path length: 8
(f)
Number of wall cells: 188
Mean agent path length: 636
(e)
Number of wall cells: 150
Mean agent path length: 369
DSAGE
Figure 7.3: Archive and levels generated by DSAGE in the Maze domain. The agent’s initial position is shown as an
orange triangle, while the goal is a green square.
7.6.4 Qualitative Results
Figure 7.3 and Figure 7.4 show example environments generated by DSAGE in the Maze and Mario domains.
Having the mean agent path length as a measure in the Maze domain results in environments of varying difficulty for
the ACCEL agent. For instance, we observe that the environment in Figure 7.3(a) has very few walls, yet the ACCEL
agent gets stuck in the top half of the maze and is unable to find the goal within the allotted time. On the other hand, the
environment in Figure 7.3(d) is cluttered with multiple dead-ends, yet the ACCEL agent is able to reach the goal.
Figure 7.4 shows that the generated environments result in qualitatively diverse behaviors for the Mario agent too.
Level (b) only has a few sky tiles and is mostly flat, resulting in a small number of jumps. Level (c) has a “staircase
trap” on the right side, forcing the agent to perform continuous jumps to escape and complete the level. We include
videos of the playthroughs in the supplemental material.
7.7 Limitations and Future Work
Automatic environment generation is a rapidly growing research area with a wide range of applications, including
designing video game levels [246, 263, 187], training and testing autonomous agents [236, 45, 282, 60, 220], and
162
Sky tiles
0 20 40 60 80 100 120 140
(d)
Sky tiles: 140, Number of jumps: 6
(a)
Sky tiles: 40, Number of jumps: 40
0
20
40
60
80
100
0.2
0.4
0.6
0.8
1.0 (c)
Sky tiles: 100, Number of jumps: 50
DSAGE
Number of jumps
(b)
Sky tiles: 1, Number of jumps: 3
Figure 7.4: Archive and levels generated by DSAGE in the Mario domain. Each level shows the path Mario takes,
starting on the left of the level and finishing on the right.
discovering failures in human-robot interaction [85, 89]. We introduce the DSAGE algorithm, which efficiently
generates a diverse collection of environments via deep surrogate models of agent behavior.
Our paper has several limitations. First, occupancy grid prediction does not encode temporal information about the
agent. While this prediction allows us to avoid the compounding error problem of model-based RL [289], forgoing
temporal information makes it harder to predict some behaviors, such as the number of jumps in Mario. We will explore
this trade-off in future work.
Furthermore, we have studied 2D domains where a single ground-truth evaluation lasts between a few seconds and
a few minutes. We are excited about the use of surrogate models to predict the performance of agents in more complex
domains with expensive, high-fidelity simulations [288].
163
Chapter 8
Quality Diversity Scenario Generation for Complex Human-Robot
Interaction
8.1 Introduction
* As the complexity of robotic systems that interact with people increases, it becomes impossible for designers and
end-users to anticipate how a robot will act in different environments and with different users. For instance, consider a
robotic arm collaborating with a user on a package labeling task, where a user attaches a label while the robot presses a
stamp with the goal of completing the task as fast as possible (Fig. 8.1). The robot’s motion depends on which object
the user selects to label, how the user moves towards that object, and how all objects are arranged in the environment.
Thus, evaluating the system requires testing it with a diverse range of user behaviors and object arrangements.
While user studies are essential for understanding how users will interact with a robot, they are limited in the number
of environments and user behaviors they can cover. Algorithmically generating scenarios with simulated robot and
human behaviors in an “Oz of Wizard” paradigm [260] can complement user studies by finding failures and elucidating
a holistic view of the strengths and limitations of a robotic system’s behavior.
Previous work [85, 90] has formulated algorithmic scenario generation as a quality diversity (QD) problem and
demonstrated the effectiveness of QD algorithms in generating diverse collections of scenarios in a shared control
teleoperation domain. In that domain, a user teleoperates a robotic arm with a joystick interface, while the robot
observes the joystick inputs to infer the user’s goal and assist the user in reaching their goal. However, these interactions
*Work led by Varun Bhatt at the ICAROS Lab.
164
Figure 8.1: Example scenario in a collaborative package labeling task found by our proposed surrogate assisted scenario
generation framework. The presence of the two objects behind the robot results in its expected cost-minimizing policy
to move towards the object in the front, resulting in a conflict with the user who is reaching the object at the same time.
only last a few seconds in contrast to collaborative, sequential tasks that last much longer. For instance, completing the
package labeling task (Fig. 8.1) can take several minutes, making their evaluation expensive. This limits the applicability
of QD algorithms, which require a large number of evaluations [97].
Our key insight is that we can train deep neural networks as surrogate models to predict human-robot interaction outcomes and integrate them into the scenario generation process. In addition to making scenario evaluations inexpensive,
deep neural networks are end-to-end differentiable, which allows us to integrate state-of-the-art differentiable quality
diversity (DQD) algorithms [87] to efficiently discover scenarios that the surrogate model predicts are challenging with
diverse behavior.
We make the following contributions: (1) We introduce using deep neural networks as surrogate models to predict
human-robot interaction outcomes, such as time to task completion, maximum robot path length, or total waiting
time; (2) We integrate surrogate models with differentiable quality diversity (DQD) algorithms that leverage gradient
information backpropagated through the surrogate model. (3) We show, in the shared control teleoperation domain of
previous work [85] and in a shared workspace collaboration domain [222], that surrogate assisted scenario generation
results in significant benefits in terms of sample efficiency. It also achieves a significant reduction in computational
time in the collaboration domain, where evaluations are particularly expensive.
165
8.2 Problem Statement
We model the problem of generating a diverse and challenging dataset of human-robot interaction scenarios as a quality
diversity (QD) problem and adopt the QD definition from prior work [87].
We assume a scenario parameterized by θ ∈ R
n. The scenario parameters could be object positions and types in the
environment, human model parameters, or latent inputs to a generative model of environments, which is converted to a
scenario via a function G(θ). The objective function f : R
n → R assesses the quality of a scenario θ. Because we
wish to find challenging scenarios, higher quality implies worse team performance, e.g., longer task completion time.
We further assume a set of user-defined measure functions, mi
: R
n → R, or as a vector function m : R
n → R
k
,
that quantify aspects of the scenario that we wish to diversify, e.g., distance between objects, noise in user inputs, or
human and robot path length. The range of m forms a measure space S = m(R
n), which we assume is tessellated into
M cells, forming an archive.
The QD objective is to maximize the QD-Score [229]: maxθi
PM
i=1 f(θi). Here thetai refers to the scenario with
the highest quality in cell i of the archive. If there are no scenarios in a cell, f(θi) is assumed to be zero.
The differentiable quality diversity (DQD) problem formulation is a special case of QD where the objective function
f and measure functions m are first-order differentiable.
8.3 Background
8.3.1 Scenario Generation.
Algorithmic scenario generation has many applications, which include designing video game levels [113, 89, 70, 165,
259, 244, 243] and testing autonomous vehicles [9, 205, 2, 237, 102, 241, 94], motion planning algorithms [295], and
reinforcement learning agents [21, 59, 45, 236]. It has also been applied to create curricula for robot learning [195,
176, 78, 77] and to co-evolve agents and environments for agent generalizability [282, 283, 95, 26, 64, 63, 60, 150,
220]. Most relevant to our work is prior work [85, 90] in human-robot interaction that applied the MAP-Elites [202]
and CMA-ME [91] quality diversity (QD) algorithms to find robot failures in a shared control teleoperation domain. In
this work, we significantly improve sample and wall-clock efficiency by combining the state-of-the-art QD algorithms
CMA-MAE [86] and CMA-MAEGA [86] with surrogate models that predict human-robot interaction outcomes.
166
Sample Gradient
Coefficients
Branch Scenarios via
Surrogate Model Gradients
Generate Scenarios and Evaluate
on Surrogate Model
Predict System Performance,
Measures, and Occupancy
for each Scenario
Update Surrogate Archive Ascend to Maximize
QD Objective
Simulate Scenarios from
Surrogate Archive
Label Ground-truth System
Performance, Measures, and
Occupancy for each Scenario Update Ground-Truth Archive
Update Scenario
Dataset Train the Surrogate Model
on the Scenario Dataset
Inner Loop
Outer Loop
Self-supervised Robot and Human
State Occupancy Prediction A Predict
Human
Occupancy
Predict
Robot
Occupancy
Input Scenario
MIP Repair of Scenarios
A
Figure 8.2: An overview of our proposed differentiable surrogate assisted scenario generation (DSAS) algorithm for
HRI tasks. The algorithm runs in two stages: an inner loop to exploit a surrogate model of the human and the robot
behavior (red arrows) and an outer loop to evaluate candidate scenarios and add them to a dataset (blue arrows).
167
8.3.2 QD Algorithms.
QD algorithms, such as MAP-Elites [202] and CMA-MAE [91, 86], solve the QD problem defined in Section 8.2. They
have been used to generate diverse locomotion strategies [202, 267], video game levels [226], nano-materials [151],
and building layouts [99]. Certain prior QD algorithms [97, 119, 292, 21] have leveraged surrogate models based on
Gaussian processes or deep neural networks to guide the search. Most relevant to our work is the Deep Surrogate
Assisted Generation of Environments (DSAGE) [21] algorithm, which exploits a surrogate model with quality diversity
algorithms to generate environments. However, DSAGE has only been applied in single-agent grid-world game
domains. In contrast, this work addresses the much more complex task of human-robot interaction scenarios, which
requires the advances in the surrogate model, QD search, and scenario generation described in section 8.4.
8.4 Surrogate Assisted Scenario Generation
Our method for algorithmically generating diverse collections of HRI scenarios builds upon recent work in generating
single-agent grid-world environments with surrogate models [21]. We briefly describe the advances necessary to scale
these techniques from single-agent grid-world game domains to the much more complex HRI domains.
8.4.1 Surrogate Models for Human-Robot Interaction.
We scale the surrogate model to predict outcomes of HRI scenarios that include both robot and human behavior and
environment parameters. First, we allow both the environment and the human model parameters as inputs to the
surrogate model. Second, we discretize the shared workspace and predict two occupancy grids, one for the human
and one for the robot. We then stack both predictions as inputs to a convolutional neural network, which predicts the
objective and measure functions.
8.4.2 Scenario Repair via Mixed Integer Programming.
In contrast to previous work [85] that considered a single workspace, scenarios in our work have disjoint workspaces,
with each workspace imposing constraints on object arrangement, such as boundary and collision avoidance constraints.
Simulating the scenario is impossible if these constraints are not satisfied. We thus adopt a generate-then-repair
168
approach [291, 88], where we generate unconstrained scenarios and pass them through a mixed integer program
(MIP). We propose a MIP formulation that solves for the minimum cost edit of object locations – the sum of object
displacement distances – that satisfies the constraints for a valid scenario.
8.4.3 Objective Regularization.
The generate-then-repair strategy repairs invalid scenarios by moving objects placed outside the workspace boundaries
to the edge of the workspaces. However, this does not incentivize the QD algorithm to search the workspace interiors,
which can result in the search diverging away from the workspace areas. To guide the search towards the workspace
interiors, we discount the objective function f of the QD formulation by the cost of the MIP repair.
8.4.4 DQD with Surrogate Models.
The DSAGE algorithm exploits the surrogate model with derivative-free QD algorithms. A key observation is that the
surrogate model is an end-to-end differentiable neural network. We can take advantage of this by exploiting the surrogate
model with differentiable quality diversity (DQD) algorithms [87], which leverage the gradients of the objective and
measure functions to accelerate QD optimization. Leveraging DQD also lets us scale to higher dimensional scenario
parameter spaces since the search is applied over the objective-measure space (k + 1 dimensions) instead of the scenario
parameter space (n dimensions).
8.4.5 Algorithm.
Fig. 8.2 provides an overview of the complete algorithm, which consists of an outer loop (blue arrows) and an inner loop
(red arrows). In the inner loop, a QD algorithm searches for scenarios that are challenging and diverse according to the
surrogate model predictions. We repair the generated scenarios to ensure validity and evaluate each repaired scenario
to obtain ground truth objective and measure values. Based on these values, we add each scenario to a ground-truth
archive and a training dataset for the surrogate model. The surrogate model trains on this dataset, correcting prediction
errors exploited by the QD algorithm. After accumulating enough diverse data, the surrogate model starts making
accurate predictions, and the inner loop produces truly diverse and challenging scenarios. After multiple outer loop
169
iterations, the ground-truth archive accumulates diverse and challenging scenarios, testing the HRI system’s strengths
and limitations.
The key idea behind the proposed algorithm is that exploiting the surrogate model with QD algorithms produces
diverse – with respect to the surrogate model predictions – scenarios. Labeling these scenarios by evaluating them in a
simulator and using them to retrain the surrogate model in turn improves its predictions in subsequent iterations. Thus,
the surrogate model self-improves over time by training on the diverse data generated according to its predictions.
The proposed improvements result in two versions of our algorithm: (1) Surrogate Assisted Scenario Generation
(SAS), which employs a derivative-free QD algorithm, CMA-MAE, in the inner loop. (2) Differentiable Surrogate
Assisted Scenario Generation (DSAS, Fig. 8.2), which employs a DQD algorithm, CMA-MAEGA, in the inner loop.
8.5 Domains
We consider two HRI domains from prior work: shared control teleoperation [148] and shared workspace collaboration [222] with a 6-DoF Gen2 Kinova JACO arm.
8.5.1 Shared Control Teleoperation.
In shared control teleoperation, a user provides low-dimensional joystick inputs towards a goal. To aid the user, the robot
attempts to infer the human goal and move towards it autonomously. An optimal policy would correctly infer the human
goal and reach the goal along the shortest path. The robot solves a POMDP with the user’s goal as a latent variable
and updates its belief based on the human input, assuming a noisily-optimal user [148]. With hindsight optimization
and first-order approximation, this results in the robot’s actions being a weighted average of the optimal path towards
each goal, where the weights are proportional to the respective goal probabilities. Scenario parameters include both the
environment, i.e., the coordinates of two goal objects, and the human actions, i.e., a set of human trajectory waypoints.
To search for failures, we set the objective to be the time taken to reach the correct goal. We diversify scenarios with
respect to the noise in human inputs (variation from the optimal path) and the scene clutter (distance between goals).
170
8.5.2 Shared Workspace Collaboration.
In shared workspace collaboration, the human and the robot simultaneously execute a sequential task in a shared
workspace with disjoint regions [222, 147]. An optimal robot policy would correctly infer the human’s current goal and
move to a different goal along the shortest path while avoiding collisions with the human. As in the teleoperation task,
the robot models the human goal as a latent variable in a POMDP (but with human hand as its observation) and uses
hindsight optimization and first-order value function approximations to act in real time. The robot attempts to avoid the
goal intended by the human by selecting the nearest goal from a feasible set of goals different than the user’s, i.e., it
maps a human candidate goal to a different goal-to-go. The robot’s action is a weighted average of the optimal path
towards each goal-to-go, with weights proportional to the probability of the corresponding human goal.
We parameterize the scenario with three goal coordinates and set the objective to be the task completion time. We
select measures based on factors that we expect to affect the team performance: object arrangement, the accuracy
of inference of user’s goal, and the distance required to reach the goals. We choose two sets of measures: (1) The
minimum distance between goal objects and the maximum probability assigned by the robot to the wrong goal. (2) The
robot path length and the total time for which the human and robot have to wait when reaching the same goal.
In this domain, we model the human as solving a softmax MDP with the maximum entropy formulation [296].
8.6 Experiments
Independent Variables. Our two independent variables are the domain and the algorithm.
Our three domains are: (a) shared control teleoperation with distance between the goals and human variation as
measures; (b) shared workspace collaboration with minimum distance between the goals and maximum wrong goal
probability as measures (collaboration I); (c) shared workspace collaboration with robot path length and total wait time
as measures (collaboration II).
In each domain, we compare five different algorithms: (a) Random Search, where we uniformly sample scenarios
from the valid regions. (b) MAP-Elites [202], as adapted for scenario generation in previous work [85], with the
additional objective regularization described in section 8.4. (c) CMA-MAE [86] with objective regularization. (d) SAS:
The proposed derivative-free version of our surrogate assisted scenario generation algorithm. We apply CMA-MAE as
171
0 5k 10k
Evaluations
0
12k
24k QD-score
0 0.9 1.8
Time (Hours)
0
10k
20k QD-score
(a) Shared control teleoperation
0 5k 10k
Evaluations
0
60k
120k QD-score
0 6.0 12.0
Time (Hours)
0
50k
100k QD-score
(b) Collaboration I
0 5k 10k
Evaluations
0
10k
20k QD-score
0 6.0 12.0
Time (Hours)
0
10k
20k QD-score
(c) Collaboration II
Figure 8.3: QD-Score attained in the three domains as a function of the number of evaluations (top) and the wall-clock
time (bottom). Algorithms with surrogate models have better sample efficiency but require more wall-clock time per
evaluation compared to other algorithms due to the overhead of model evaluations and model training. Plots show the
mean and standard error of the mean.
the derivative-free QD algorithm in the inner loop. (e) DSAS: The proposed differentiable surrogate assisted scenario
generation with the DQD algorithm CMA-MAEGA in the inner loop.
Dependent Variable: Following previous work [85], we set QD¯/score [229] as the dependent variable that
summarizes the quality and diversity of solutions. We compute the QD-Score at the end of 10,000 evaluations, averaged
over 10 trials of random search, MAP-Elites, CMA-MAE, and 5 trials of SAS and DSAS.
Hypotheses:
H1. We hypothesize that the surrogate assisted QD algorithms SAS and DSAS will outperform CMA-MAE, MAP-Elites,
and random search. We base this hypothesis on previous work, which has shown the benefit of integrating QD with
surrogate models in single-agent domains [292, 21].
H2. We hypothesize that DSAS will outperform SAS. We base this hypothesis on previous work, which has shown
that DQD algorithms perform significantly better than their derivative-free counterparts [87] when the objective and
measure gradients are available.
Analysis. A two-way ANOVA test showed a significant interaction effect (F(8.0, 105.0) = 305.79, p < 0.001).
Simple main effects analysis on each domain showed a significant effect of the algorithm on the QD¯/score (p < 0.001).
Pairwise t-tests with Bonferroni corrections showed that SAS and DSAS performed significantly better than CMA-MAE,
172
0.00 0.32
0.000
0.112
Human Variation
DSAS
0.00 0.32
SAS
0.00 0.32
CMA-MAE
0.00 0.32
MAP-Elites
0.00 0.32
Random Search
0.0
2.5
5.0
7.5
10.0
Distance Between Goals
Figure 8.4: Comparison of the final archive heatmaps in the shared control teleoperation domain.
MAP-Elites, and random search in the shared control teleoperation and collaboration I domains (p < 0.001). In the
collaboration II domain, they outperformed CMA-MAE (p < 0.001) and random search (p < 0.001), while there was
no significant difference with MAP-Elites. We attribute this to the fact that MAP-Elites can easily obtain diverse robot’s
path lengths by making small isotropic perturbations in the object positions. Fig. 8.3 shows the QD¯/score as a function
of the number of evaluations. We see that both SAS and DSAS achieve a high QD¯/score early in the search, indicating
high sample efficiency.
The comparison between SAS and DSAS showed mixed results, with SAS better in the collaboration I domain
(p < 0.001), DSAS better in the shared control teleoperation domain (p < 0.001), and no significance in the
collaboration II domain (p = 0.07). Previous work [87, 86] has shown DQD algorithms improving efficiency in very
high-dimensional search spaces by reducing the search from a high-dimensional solution space to a low-dimensional
objective-measure space. We conjecture that this explains the significant improvement in the shared control teleoperation
domain (9-dimensional solution space as opposed to 6-dimensional one in shared workspace collaboration) and we will
investigate higher-dimensional domains in future work.
Furthermore, in the shared workspace collaboration domains, where scenario evaluations last a couple of minutes
because of the larger task complexity, surrogate assistance showed wall-clock time efficiency (Fig. 8.3), unlike prior
work [21] that only showed sample efficiency improvements.
Fig. 8.4 shows heatmaps of the final archives in the shared control teleoperation domain. The heatmaps for
MAP-Elites and random search match the results from prior work [85]. These heatmaps show the advantage of
QD¯/score as a comparison metric. A higher QD-Score implies the archive is filled more and with higher quality
solutions. For example, SAS and DSAS find failure scenarios in the lower right corner of the archive (nearly optimal
human and a large distance between the goal objects), whereas MAP-Elites fails to find failures in that region. Thus,
173
a designer would not have information about the HRI algorithm’s performance in some scenarios if the scenario
generation algorithm has a low QD-score.
8.6.1 Real World Demo.
We wish to demonstrate that the generated failure scenarios are reproducible in the real world and are not just simulation
artifacts. Since the algorithms that we test have been shown to work robustly in practice [222, 148, 68, 147], and the
proposed approach discovers edge-case failures that are hard to find and rare in practice, a user study where users
can freely interact with the system would need to involve a very large number of subjects to observe these failures.
Furthermore, we would need to account for any safety concerns that arise from the unexpected robot behavior in the
corner cases. We view solving these challenges as beyond the scope of this paper, and here, we only show that these
failures will actually occur in the real world if users act in a certain way.
We recreate four example scenarios from the generated archives with a 6-DoF Gen2 Kinova JACO arm and by
having a user reproduce the motions of the simulated human. We track the human hand position with a Kinect v1 sensor
and the OpenNI package.
Incorrect robot motion because of delayed human goal inference (Fig. 8.5a): We select this scenario from the
archive generated by DSAS in the collaboration II domain. In this scenario, after the human finishes working on goal
G1 and the robot on G3, the robot is closer to G2 than its other remaining goal, G1. Based on the feasible goal set
formulation (Section 8.5), G2 becomes the goal-to-go for human candidate goals G1 and G3. Given that the combined
probability of the human going to either G1 or G3 is higher than the probability of the human going to G2, the robot
moves towards G2. However, once the robot realizes that the human is actually moving to G2 as well, the robot has to
move all the way back to goal G1, resulting in a significant delay.
Long robot motion with correct human goal inference (Fig. 8.5c): We additionally wish to find scenarios that
result in poor team performance that is not due to incorrect inference. We select a scenario from a SAS archive in the
collaboration I domain that had a low maximum wrong goal probability of 0.3. The poor performance here is caused by
the interaction between the robot’s policy and the object placement. As the robot moves between the two workspaces
following a straight line path, it reaches a configuration close to self-collision or to joint limits. This prompts the system
to re-plan and move the robot to a different configuration before continuing to move towards the goal. While re-planning
174
ensures task completion, it induces a significant delay compared to scenarios where the goals can be reached without
re-planning.
8.7 Discussion
8.7.1 Limitations.
Our approach scales surrogate assisted scenario generation from single-agent grid-world domains to complex humanrobot interaction domains with continuous actions, environment dynamics, and object locations. However, our evaluation
domains consist of objects of the same type and simple human models. We note that SAS and DSAS are general
algorithms and we are excited about integrating them with more complex models of environments [88] and human
actions [149, 40], as well as leveraging high-fidelity human model simulators [290, 75] to improve realism. Additionally,
in our domains we were able to specify good test coverage with low-dimensional measure spaces. However, domains
where we wish to obtain good test coverage over a large number of attributes would require high-dimensional measure
spaces. Integrating centroidal Voronoi tessellation (CVT) based archives [274] or measure space dimensionality
reduction methods [51] will allow us to apply SAS and DSAS to such domains. Furthermore, our system does
not explain the reason behind the observed robot behavior in the generated scenarios, and future work will explore
integrating scenario generation with methods for failure explanation [57, 131]. Finally, while we focus on a single
human interacting with a single robot, we believe that our workspace occupancy-based approach for surrogate model
predictions can be extended to multi-human-robot team settings.
8.7.2 Implications.
We presented the SAS and DSAS scenario generation algorithms that accelerate QD scenario generation via surrogate
models. Results in a shared control teleoperation domain of previous work [85] and in a shared workspace collaboration
domain show significant improvements in search efficiency when generating diverse datasets of challenging scenarios.
For the first time in surrogate assisted scenario generation methods, we see improvements not only in sample
efficiency but also in wall-clock time in the shared workspace collaboration domain, where evaluations last a couple of
minutes. On the other hand, the additional computation in the inner loop of the surrogate assisted algorithms resulted in
175
(a) Scenario 1 (b) Scenario 2
(c) Scenario 3 (d) Scenario 4
Figure 8.5: Example scenarios recreated with a real robot. The purple line shows the simulated human path. Videos of
the simulated and recreated scenarios are included in the supplemental material.
176
more time required to match and exceed the performance of the baselines in the shared control teleoperation domain,
where scenario evaluations last only a few seconds. Thus, for running-time performance, we recommend surrogate
assisted methods in domains with expensive evaluations, in which the additional computation in the inner loop is offset
by the improvement in sample efficiency.
We additionally highlight an unexpected benefit of our system during development. When we tested the shared
workspace collaboration domain, SAS and DSAS discovered failure scenarios that exploited bugs in our implementation
which were subsequently fixed. For instance, some goal locations were reachable by the robot arm in the real world but
unreachable in simulation because of small errors in the robot’s URDF file, which prompted us to correct it.
Overall, we envision the proposed algorithms as a valuable tool to accelerate the development and testing of HRI
systems before user studies and deployment. We consider this an important step towards circumventing costly failures
and reducing the risk of human injuries, which is a critical milestone for widespread acceptance and use of HRI systems.
177
Chapter 9
Conclusion
9.1 Summary
In this dissertation, we focused on the problem of algorithmically generating scenarios to evaluate human robot
interaction algorithms. We argue that adversarial attacks by optimizing some objective function f to find a single failure
scenario is not enough to thoroughly evaluate a human robot interaction algorithm. Instead, this dissertation modeled
the problem as a quality diversity problem optimizing for an objective function f and measure functions mi
. This
allows our scenario generation system to find a diverse collection of failures to evaluate the human robot interaction
system.
In Chapter 2, we presented a case study in shared autonomy algorithms. We model the problem of finding diverse
failures as a quality diversity problem where we directly search over scenario parameters. In the case of shared
autonomy, our scenario parameters include objects on a table and waypoints in a human trajectory. We directly optimize
the scenario parameters with the quality diversity algorithm MAP-Elites [53, 202].
This initial scenario generation system showed the promise of quality diversity optimization as a way to test human
robot interaction algorithms. However, we noticed several limitations in the proposed system. (1) Being based on
genetic algorithms, MAP-Elites is not an efficient optimization algorithm and requires many evaluations to achieve
good results. (2) Scenarios in shared autonomy are relatively simple, as they consist only of a few objects on a table.
We need the system to scale to larger environments to cover the general case on humans interacting with robots. (3)
Complex scenarios have complex validity constraints. We need a method to constrain the search space of scenarios to a
space of valid scenarios. (4) The evaluations of shared autonomy take at most 10 seconds. The proposed algorithm
178
takes too many simulation evaluations to achieve good results. In order to scale to complex scenarios, we need to
improve the sample efficiency of evaluation. The middle chapters of the dissertation focus on each of these limitations.
Chapter 3 addresses limitation (1) by formalizing the problem of quality diversity optimization and deriving quality
diversity algorithms based on modern optimization algorithms. First, we show how the quality diversity problem can be
reformulated as a non-stationary single objective optimization problem by maximizing improvement over a changing
archive of solutions. This allowed as to adapt a state-of-the-art blackbox optimization algorithm CMA-ES [123] to the
problem of quality diversity optimization. Next, we formalized the problem of differentiable quality diversity (DQD)
optimization, where the objective f and measures mi are first-order differentiable functions. We propose the first DQD
algorithms OMG-MEGA and CMA-MEGA based on gradient ascent. We showed the quality diversity problem can be
solved orders of magnitude more efficiently by leveraging gradient information. Finally, colleagues highlighted three
limitations to the proposed algorithms. We proposed the notion of soft archives which adjust how quickly the discount
function fA changes when we optimize to maximize improvement f − fA in the non-stationary optimization problem.
This simple change to our proposed algorithm addresses all three limitations of our proposed algorithms.
Chapter 4 addresses issue (2) by scaling scenario generation to complex and realistic environments. We propose
generating Mario levels by training a generative model on a dataset of video game levels. Instead of searching over
environment parameters directly with quality diversity optimization, we propose latent space illumination, which
searches over the latent space of the generative model of environments. Our method outperforms latent space
optimization methods that optimize the latent space with CMA-ES, as well as random sampling of the latent space.
By searching over the latent space of a generative model, we constrain the search space of environments to those that
match a given dataset.
Chapter 5 addresses issue (3) by introducing hard validity constraints to generative models. For example, when
producing a kitchen environment, we may want to guarantee that the kitchen is valid by making key objects reachable by
both the robot and human, making sure that the boundary is surrounded so either agent cannot escape the scenario, and
restricting the number of robots and humans in the environment. We show that such hard constraints can be modeled
as mixed integer programming, where the cost function for a repair is the minimum edit distance between an invalid
environment and a valid one. This allows us to constrain the output of a generative model, rather than changing the
179
generative model architecture to constrain the search space to the space of valid scenarios. Finally, we show that the
generative model can be made end-to-end differentiable, by backpropagating gradients through the MIP repair.
Chapter 6 shows that issues (1), (2), and (3) can be addressed jointly in a quality diversity scenario generation
system. As a case study, we generate kitchen environments for the video game Overcooked. We train a generative model
on a dataset of kitchens. We constrain kitchen environments to valid kitchen environments solvable by a human-robot
team via mixed integer programming. Finally, we evaluate the human-robot team on the kitchen environments, which
results in the team performance f and measures mi
that characterize diverse coordination behaviors. By optimizing the
latent space with our CMA-ME algorithm, we efficiently search for environments that create imbalance in coordination
between agents.
Chapter 7 addresses issue (4). We propose leveraging surrogate models to replace the expensive evaluation of
human robot interaction simulation. Our method optimizes the surrogate objective and measures via quality diversity
optimization. As the surrogate model is end-to-end differentiable, we can leverage DQD algorithms to optimize the
scenario parameters, rather than black-box quality diversity algorithms. We then label the solutions found by QD by
evaluating the scenarios in simulation. This labeled data helps correct the errors in the surrogate model. Over time
the surrogate model becomes more accurate in predicting the objectives and measures, allowing the QD algorithm
to discover diverse and high quality scenarios. Our key insight is quality diversity optimization allows us to obtain
diverse data to train a surrogate model, while a surrogate model can improve the sample efficiency of quality diversity
optimization.
Chapter 8 combines the insights from the middle chapters of the dissertation to evaluate more complex human robot
interaction. We propose evaluating a package labeling task where a human is writing addresses on packages, while the
robot stamps the packages. A surrogate model predicts the human-robot team’s performance via a surrogate model
to predict the outcome of each scenario. We constrain the environment with mixed integer programming to ensure
all objects are placed in valid workspaces on a table and do not overlap. We show that our more complex scenario
generation system is more efficient at scenario generation than the simple system proposed in Chapter 2, highlighting
the benefits of addressing issues 1-4 in quality diversity scenario generation.
Overall, we presented a general framework for quality diversity scenario generation. Our general framework can be
tailored to any human-robot interaction system if the scenario can be modeled as searchable parameters. While this
180
dissertation focused on quality diversity scenario generation, we derive state-of-the-art quality diversity algorithms for
both the black-box setting and domains where first-order gradient information is available.
9.2 Future Directions
In this dissertation, we proposed a general quality diversity scenario generation framework and improved upon that
system by highlighting four limitations in the initial framework. This section will highlight attention limitations that need
to be addressed to improve quality diversity scenario generation. (1) Quality diversity algorithms still require discrete
approximations of the quality diversity problem. (2) Quality diversity algorithms fail to scale to high dimensional
measure spaces. (3) We must manually specify measure functions to test a human-robot interaction algorithms. (4)
We directly model human actions as waypoints. (5) Generative models in this dissertation are based off generative
adversarial networks, rather than the current state-of-the art diffusion models.
To address limitation (1), we must derive continuous quality diversity optimization that optimize a continuous
form of the quality diversity problem, rather than a discrete optimization. Prior work [175] suggest that continuous
counterparts exist for the discrete version of the problem. A continuous version of the differentiable quality diversity
problem would allow the CMA-ES component of the algorithm to be removed as we could compute the gradient
of maximum improvement in measure space analytically rather than via an approximation. This would allow for
end-to-end differentiable quality diversity algorithms.
Prior work has looked to address limitation (2) via methods like centroidal voronoi tesselation [274]. However,
the curse of dimensionality prevents quality diversity algorithms from efficiently filling the measure space. Instead,
future algorithms should rely on generalization similar to what we observe in high dimensional generative models and
contrastive models like CLIP [231]. Generalization will allow for novel queries to be processed at test time and would
result in more useful quality diversity algorithms that scale to large dimensional measure spaces.
To address limitation (3), prior work has tried to learn the measure functions via techniques like dimensionality
reduction [51]. These approaches assume a low dimensional manifold of the high dimensional measure space exists
and can better model the quality diversity problem. However, in many quality diversity problems there is a mismatch
between the intrinsic diversity of the QD problem and the observed diversity from human robot interaction. Future
work should aim to learn measure functions in a self-supervised way from data. A promising research direction is to
181
develop a model like CLIP [231] that maps search space to measure space in an automated way. This would remove the
burden of manually specifying measures.
To address limitation (4), future work will extend the advancements in quality diversity reinforcement learning [226,
211, 267] to modelling human behavior. The current state-of-the-art QD-RL algorithm PPGA [16] is capable of solving
complex humanoid locomotion problems. Future work should leverage imitation learning techniques combined with
PPGA to learn human behaviors from data. QD-RL algorithms can learn a diverse collection of agents that could
replace directly modelling human actions. Then a scenario generation system could directly search over both models of
human behavior and the latent space of a generative model of environments. This will allow us to not only capture
the realism and diversity of human environments, but the realism and diversity of human behavior as well to more
thoroughly evaluate human-robot interaction systems.
Finally, current work relies on training generative models of environments via generative adversarial networks
(GAN). To address limitation (5), we must scale scenario generation systems to state-of-the-art diffusion models. Unlike
GANs, gradients cannot be easily backpropagated from the output of the network to the latent space. This makes
methods like latent space illumination via differentiable quality diversity impossible. Current methods apply black-box
QD optimization to diffusion models. Future work will investigate how gradient information could improve quality
diversity algorithms searching the latent space of diffusion models.
This dissertation is only the dawn of quality diversity scenario generation. As the field matures, scenario generation
algorithms will scale to larger scale scenarios, require less fine tuning to new problems, and evaluate systems more
efficiently than current algorithms. Future systems will automatically search realistic scenarios by leveraging data on
both human environments and human behavior, leading to safer robots in the world.
182
Bibliography
[1] Daniel Aarno, Staffan Ekvall, and Danica Kragic. “Adaptive virtual fixtures for machine-assisted teleoperation
tasks”. In: Proceedings of the 2005 IEEE international conference on robotics and automation. IEEE. 2005,
pp. 1139–1144.
[2] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek. “Generating Adversarial Driving Scenarios in
High-Fidelity Simulators”. In: 2019 International Conference on Robotics and Automation (ICRA). May 2019,
pp. 8271–8277. DOI: 10.1109/ICRA.2019.8793740.
[3] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. “Bidirectional relation between CMA
evolution strategies and natural evolution strategies”. In: International Conference on Parallel Problem Solving
from Nature. Springer. 2010, pp. 154–163.
[4] Rachid Alami, Aurelie Clodic, Vincent Montreuil, Emrah Akin Sisbot, and Raja Chatila. “Toward ´
Human-Aware Robot Task Planning.” In: AAAI spring symposium: to boldly go where no human-robot team
has gone before. 2006, pp. 39–46.
[5] Alberto Alvarez, Steve Dahlskog, Jose Font, and Julian Togelius. “Empowering Quality Diversity in Dungeon
Design with Interactive Constrained MAP-Elites”. In: June 2019.
[6] Brandon Amos and J Zico Kolter. “Optnet: Differentiable optimization as a layer in neural networks”. In:
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017,
pp. 136–145.
[7] Dejanira Araiza-Illan, David Western, Anthony Pipe, and Kerstin Eder. “Coverage-Driven Verification—”. In:
Haifa Verification Conference. Springer. 2015, pp. 69–84.
[8] Dejanira Araiza-Illan, David Western, Anthony G Pipe, and Kerstin Eder. “Systematic and realistic testing in
simulation of control code for robots in collaborative human-robot interactions”. In: Annual Conference
Towards Autonomous Robotic Systems. Springer. 2016, pp. 20–32.
[9] James Arnold and Rob Alexander. “Testing Autonomous Robot Control Software Using Procedural Content
Generation”. In: Proceedings of the 32Nd International Conference on Computer Safety, Reliability, and
Security - Volume 8153. SAFECOMP 2013. Toulouse, France: Springer-Verlag New York, Inc., 2013,
pp. 33–44. ISBN: 978-3-642-40792-5. DOI: 10.1007/978-3-642-40793-2_4.
[10] Kai Arulkumaran, Antoine Cully, and Julian Togelius. “AlphaStar: An Evolutionary Computation Perspective”.
In: GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion. Ed. by
Manuel Lopez-Ib ´ a´nez. Prague, Czech Republic: ACM, 2019. ˜ ISBN: 978-1-4503-6748-6.
[11] Anne Auger and Nikolaus Hansen. “A restart CMA evolution strategy with increasing population size”. In:
2005 IEEE congress on evolutionary computation. Vol. 2. IEEE. 2005, pp. 1769–1776.
183
[12] Maren Awiszus, Frederik Schubert, and Bodo Rosenhahn. “TOAD-GAN: coherent style level generation from a
single example”. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital
Entertainment. Vol. 16. 1. 2020, pp. 10–16.
[13] Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. “Goal inference as inverse planning”. In:
Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 29. 2007.
[14] Samuel Barrett, Noa Agmon, Noam Hazon, Sarit Kraus, and Peter Stone. “Communicating with Unknown
Teammates.” In: ECAI. 2014, pp. 45–50.
[15] Thomas Bartz-Beielstein. “A survey of model-based methods for global optimization”. In: Bioinspired
Optimization Methods and Their Applications (2016).
[16] Sumeet Batra, Bryon Tjanaka, Matthew Christopher Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, and
Gaurav S. Sukhatme. “Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning”.
In: The Twelfth International Conference on Learning Representations. 2024.
[17] Robin Baumgarten. Infinite Super Mario AI. 2009. URL:
https://wobblylabs.com/projects/marioai.
[18] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. “The arcade learning environment: An
evaluation platform for general agents”. In: Journal of Artificial Intelligence Research 47 (2013), pp. 253–279.
[19] Dimitris Bertsimas and John Tsitsiklis. “Simulated annealing”. In: Statistical science 8.1 (1993), pp. 10–15.
[20] Aditya Bhatt, Scott Lee, Fernando de Mesentier Silva, Connor W. Watson, Julian Togelius, and Amy K. Hoover.
“Exploring the Hearthstone Deck Space”. In: Proceedings of the 13th International Conference on the
Foundations of Digital Games. ACM, 2018, p. 18.
[21] Varun Bhatt, Bryon Tjanaka, Matthew Fontaine, and Stefanos Nikolaidis. “Deep surrogate assisted generation
of environments”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 37762–37777.
[22] Rafał Biedrzycki. “Handling bound constraints in CMA-ES: An experimental study”. In: Swarm and
Evolutionary Computation 52 (2020), p. 100627.
[23] Zeungnam Bien, Myung-Jin Chung, Pyung-Hun Chang, Dong-Soo Kwon, Dae-Jin Kim, Jeong-Su Han,
Jae-Hean Kim, Do-Hyung Kim, Hyung-Soon Park, Sang-Hoon Kang, et al. “Integration of a rehabilitation
robotic system (KARES II) with human-friendly man-machine interaction units”. In: Autonomous robots 16.2
(2004), pp. 165–191.
[24] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. “Optimizing the latent space of
generative networks”. In: arXiv preprint arXiv:1707.05776 (2017).
[25] Philip Bontrager, Aditi Roy, Julian Togelius, Nasir Memon, and Arun Ross. “Deepmasterprints: Generating
masterprints for dictionary attacks via latent variable evolution”. In: 2018 IEEE 9th International Conference
on Biometrics Theory, Applications and Systems (BTAS). IEEE. 2018, pp. 1–9.
[26] David M Bossens and Danesh Tarapore. “QED: Using Quality-Environment-Diversity to Evolve Resilient
Robot Swarms”. In: IEEE Transactions on Evolutionary Computation (2020).
[27] Leon Bottou. “Stochastic gradient descent tricks”. In: ´ Neural networks: Tricks of the trade. Springer, 2012,
pp. 421–436.
184
[28] Jonathan C Brant and Kenneth O Stanley. “Diversity preservation in minimal criterion coevolution through
resource limitation”. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference. 2020,
pp. 58–66.
[29] Jonathan C Brant and Kenneth O Stanley. “Minimal criterion coevolution: a new approach to open-ended
search”. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2017, pp. 67–74.
[30] Alexander Broad, Todd Murphey, and Brenna Argall. “Operation and imitation under safety-aware shared
control”. In: International Workshop on the Algorithmic Foundations of Robotics. Springer. 2018, pp. 905–920.
[31] David Brookes, Akosua Busia, Clara Fannjiang, Kevin Murphy, and Jennifer Listgarten. “A view of estimation
of distribution algorithms through the lens of expectation-maximization”. In: Proceedings of the 2020 Genetic
and Evolutionary Computation Conference Companion. 2020, pp. 189–190.
[32] Noam Brown and Tuomas Sandholm. “Superhuman AI for Multiplayer Poker”. In: Science (2019).
[33] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in commercial gender
classification”. In: Conference on fairness, accountability and transparency. PMLR. 2018, pp. 77–91.
[34] Tom Carlson and Yiannis Demiris. “Collaborative control for a robotic wheelchair: evaluation of performance,
attention, and workload”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42.3
(2012), pp. 876–888.
[35] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. “On the
utility of learning about humans for human-AI coordination”. In: Advances in Neural Information Processing
Systems. 2019, pp. 5174–5185.
[36] Leo Cazenille, Nicolas Bredeche, and Nathanael Aubert-Kato. “Exploring self-assembling behaviors in a
swarm of bio-micro-robots using surrogate-assisted map-elites-elites”. In: 2019 IEEE Symposium Series on
Computational Intelligence (SSCI). IEEE. 2019, pp. 238–246.
[37] Tathagata Chakraborti, Sarath Sreedharan, and Subbarao Kambhampati. “Human-aware planning revisited: A
tale of three models”. In: Proc. of the IJCAI/ECAI 2018 Workshop on EXplainable Artificial Intelligence (XAI).
That paper was also published in the Proc. of the ICAPS 2018 Workshop on
EXplainbrown2019superhumanable AI Planning (XAIP). 2018, pp. 18–25.
[38] Konstantinos Chatzilygeroudis, Antoine Cully, Vassilis Vassiliades, and Jean-Baptiste Mouret.
“Quality-Diversity Optimization: a novel branch of stochastic optimization”. In: Black Box Optimization,
Machine Learning, and No-Free Lunch Theorems. Springer, 2021, pp. 109–135.
[39] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and
Ilya Sutskever. “Generative Pretraining from Pixels”. In: Proceedings of the 37th International Conference on
Machine Learning. 2020.
[40] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha Srinivasa. “Trust-aware decision
making for human-robot collaboration: Model learning and planning”. In: ACM Transactions on Human-Robot
Interaction (THRI) 9.2 (2020), pp. 1–23.
[41] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI
Gym. https://github.com/maximecb/gym-minigrid. 2018.
[42] Jiyoung Choi. “Model checking for decision making behaviour of heterogeneous multi-agent autonomous
system”. PhD thesis. Cranfield University, 2013.
185
[43] Tae Jong Choi and Julian Togelius. “Self-referential quality diversity through differential MAP-Elites”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. 2021, pp. 502–509.
[44] Yoeng-Jin Chu. “On the shortest arborescence of a directed graph”. In: Scientia Sinica 14 (1965),
pp. 1396–1400.
[45] Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. “Leveraging procedural generation to benchmark
reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2020.
[46] Cedric Colas, Vashisht Madhavan, Joost Huizinga, and Jeff Clune. “Scaling map-elites to deep neuroevolution”. ´
In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference. 2020, pp. 67–75.
[47] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley, and Jeff Clune.
“Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of
Novelty-Seeking Agents”. In: Proceedings of the 32Nd International Conference on Neural Information
Processing Systems. NIPS’18. Montréal, Canada: Curran Associates Inc., 2018, pp. 5032–5043. URL:
http://dl.acm.org/citation.cfm?id=3327345.3327410.
[48] Jacob W Crandall and Michael A Goodrich. “Characterizing efficiency of human robot interaction: A case
study of shared-control teleoperation”. In: IEEE/RSJ international conference on intelligent robots and systems.
Vol. 2. IEEE. 2002, pp. 1290–1295.
[49] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and
Edward Raff. “Vqgan-clip: Open domain image generation and editing with natural language guidance”. In:
European Conference on Computer Vision. Springer. 2022, pp. 88–105.
[50] Giuseppe Cuccu, Julian Togelius, and Philippe Cudre-Mauroux. “Playing atari with six neurons”. In: ´
Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International
Foundation for Autonomous Agents and Multiagent Systems. 2019, pp. 998–1006.
[51] Antoine Cully. “Autonomous Skill Discovery with Quality-Diversity and Unsupervised Descriptors”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’19. ACM, 2019, pp. 81–89.
[52] Antoine Cully. “Multi-emitter MAP-elites: improving quality, diversity and data efficiency with heterogeneous
sets of emitters”. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2021, pp. 84–92.
[53] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. “Robots that can adapt like animals”.
In: Nature 521.7553 (2015), p. 503.
[54] Antoine Cully and Yiannis Demiris. “Quality and diversity optimization: A unifying modular framework”. In:
IEEE Transactions on Evolutionary Computation 22.2 (2017), pp. 245–259.
[55] Antoine Cully and Jean-Baptiste Mouret. “Behavioral Repertoire Learning in Robotics”. In: Proceedings of the
15th Annual Conference on Genetic and Evolutionary Computation (GECCO ‘13). ACM, 2013, pp. 175–182.
[56] Antoine Cully and Jean-Baptiste Mouret. “Evolving a Behavioral Repertoire for a Walking Robot”. In:
Evolutionary Computation 24 (1 2016), pp. 59–88.
[57] Devleena Das, Siddhartha Banerjee, and Sonia Chernova. “Explainable AI for Robot Failures: Generating
Explanations That Improve User Assistance in Fault Recovery”. In: Proceedings of the ACM/IEEE
International Conference on Human-Robot Interaction. 2021. DOI: 10.1145/3434073.3444657.
[58] Cedric Decoster, Jean Seong Bjorn Choe, et al. Sabberstone. Accessed: 2019-11-01. 2019. URL:
https://github.com/HearthSim/SabberStone.
186
[59] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han,
Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. “ProcTHOR: Large-Scale Embodied AI Using
Procedural Generation”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 5982–5994.
[60] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre M. Bayen, Stuart Russell, Andrew Critch, and
Sergey Levine. “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”. In:
Advances in Neural Information Processing Systems 33. 2020. URL: https://proceedings.neurips.
cc/paper/2020/hash/985e9a46e10005356bbaf194249f6856-Abstract.html.
[61] Jyotirmoy Deshmukh, Marko Horvat, Xiaoqing Jin, Rupak Majumdar, and Vinayak S Prabhu. “Testing
cyber-physical systems through bayesian optimization”. In: ACM Transactions on Embedded Computing
Systems (TECS) 16.5s (2017), pp. 1–18.
[62] Jyotirmoy Deshmukh, Xiaoqing Jin, James Kapinski, and Oded Maler. “Stochastic local search for falsification
of hybrid systems”. In: International Symposium on Automated Technology for Verification and Analysis.
Springer. 2015, pp. 500–517.
[63] Aaron Dharna, Amy K Hoover, Julian Togelius, and Lisa Soros. “Transfer Dynamics in Emergent Evolutionary
Curricula”. In: IEEE Transactions on Games (2022).
[64] Aaron Dharna, Julian Togelius, and Lisa B Soros. “Co-generation of game levels and game-playing agents”. In:
Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. 2020.
[65] Rosen Diankov and James Kuffner. “Openrave: A planning architecture for autonomous robotics”. In: Robotics
Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-08-34 79 (2008).
[66] Carl Doersch. “Tutorial on variational autoencoders”. In: arXiv preprint arXiv:1606.05908 (2016).
[67] Finale Doshi and Nicholas Roy. “Efficient model learning for dialog management”. In: Proceedings of the
ACM/IEEE international conference on Human-robot interaction. ACM. 2007, pp. 65–72.
[68] A.D. Dragan and S.S. Srinivasa. “Formalizing Assistive Teleoperation”. In: Proc. Robotics: Science and
Systems Conference. 2012.
[69] Anca D Dragan and Siddhartha S Srinivasa. “A policy-blending formalism for shared control”. In: The
International Journal of Robotics Research 32.7 (2013), pp. 790–805.
[70] Sam Earle, Justin Snider, Matthew C Fontaine, Stefanos Nikolaidis, and Julian Togelius. “Illuminating diverse
neural cellular automata for level generation”. In: Proceedings of the Genetic and Evolutionary Computation
Conference. 2022, pp. 68–76.
[71] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. “First return, then explore”.
In: Nature 590.7847 (2021), pp. 580–586.
[72] A. E. Eiben, R. Hinterding, and Z. Michalewicz. “Parameter control in evolutionary algorithms”. In: IEEE
Transactions on Evolutionary Computation 3.2 (July 1999), pp. 124–141. DOI: 10.1109/4235.771166.
[73] Blizzard Entertainment. Hearthstone. https://playhearthstone.com/en-us/. Accessed:
2019-11-01.
[74] Sixense Entertainment. “Razer Hydra. 2011”. In: ().
[75] Zackory Erickson, Vamsee Gangaram, Ariel Kapusta, C Karen Liu, and Charles C Kemp. “Assistive gym: A
physics simulation framework for assistive robotics”. In: Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA). 2020.
187
[76] Andrew H Fagg, Michael Rosenstein, Robert Platt, and Roderic A Grupen. “Extracting user intent in mixed
initiative teleoperator control”. In: Proceedings of the American Institute of Aeronautics and Astronautics
Intelligent Systems Technical Conference. 2004.
[77] Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, and Jeannette Bohg. “Active task randomization:
Learning visuomotor skills for sequential manipulation by proposing feasible and novel tasks”. In: CoRR
(2022).
[78] Kuan Fang, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. “Discovering generalizable skills via automated
generation of diverse tasks”. In: Robotics: Science and Systems (2021).
[79] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. “Testing the manifold hypothesis”. In: Journal of
the Asturtevant2020unexpectedmerican Mathematical Society 29.4 (2016), pp. 983–1049.
[80] Aaron Ferber, Bryan Wilder, Bistra Dilina, and Milind Tambe. “MIPaaL: Mixed integer program as a layer”. In:
arXiv preprint arXiv:1907.05912 (2019).
[81] Stefano Fioravanzo and Giovanni Iacca. “Evaluating MAP-Elites on constrained optimization problems”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. 2019, pp. 99–106.
[82] Stefano Fioravanzo and Giovanni Iacca. “Evaluating MAP-Elites on constrained optimization problems”. In:
arXiv preprint arXiv:1902.00703 (2019).
[83] Jaime F Fisac, Andrea Bajcsy, Sylvia L Herbert, David Fridovich-Keil, Steven Wang, Claire J Tomlin, and
Anca D Dragan. “Probabilistically safe robot planning with confidence-based human predictions”. In: arXiv
preprint arXiv:1806.00109 (2018).
[84] Manon Flageat and Antoine Cully. “Fast and stable MAP-Elites in noisy domains using deep grids”. In:
Artificial Life Conference Proceedings. MIT Press. 2020, pp. 273–282.
[85] Matthew Fontaine and Stefanos Nikolaidis. “A Quality Diversity Approach to Automatically Generating
Human-Robot Interaction Scenarios in Shared Autonomy”. In: Robotics: Science and Systems (2021).
[86] Matthew Fontaine and Stefanos Nikolaidis. “Covariance matrix adaptation map-annealing”. In: Proceedings of
the Genetic and Evolutionary Computation Conference. 2023, pp. 456–465.
[87] Matthew Fontaine and Stefanos Nikolaidis. “Differentiable quality diversity”. In: Advances in Neural
Information Processing Systems 34 (2021), pp. 10040–10052.
[88] Matthew C Fontaine, Ya-Chuan Hsu, Yulun Zhang, Bryon Tjanaka, and Stefanos Nikolaidis. “On the
importance of environments in human-robot coordination”. In: Robotics: Science and Systems (RSS) (2021).
[89] Matthew C Fontaine, Ruilin Liu, Ahmed Khalifa, Jignesh Modi, Julian Togelius, Amy K Hoover, and
Stefanos Nikolaidis. “Illuminating mario scenes in the latent space of a generative adversarial network”. In:
Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 7. 2021, pp. 5922–5930.
[90] Matthew C Fontaine and Stefanos Nikolaidis. “Evaluating Human–Robot Interaction Algorithms in Shared
Autonomy via Quality Diversity Scenario Generation”. In: ACM Transactions on Human-Robot Interaction
(THRI) 11.3 (2022), pp. 1–30.
[91] Matthew C Fontaine, Julian Togelius, Stefanos Nikolaidis, and Amy K Hoover. “Covariance matrix adaptation
for the rapid illumination of behavior space”. In: Proceedings of the 2020 genetic and evolutionary
computation conference. 2020, pp. 94–102.
188
[92] Matthew C. Fontaine, Scott Lee, L. B. Soros, Fernando de Mesentier Silva, Julian Togelius, and
Amy K. Hoover. “Mapping Hearthstone Deck Spaces through MAP-Elites with Sliding Boundaries”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’19. Prague, Czech Republic:
ACM, 2019, pp. 161–169. ISBN: 978-1-4503-6111-8. DOI: 10.1145/3321707.3321794.
[93] Kevin Frans, Lisa Soros, and Olaf Witkowski. “Clipdraw: Exploring text-to-drawing synthesis through
language-image encoders”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 5207–5218.
[94] Daniel J Fremont, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L Sangiovanni-Vincentelli, and
Sanjit A Seshia. “Scenic: a language for scenario specification and scene generation”. In: Proceedings of the
40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2019.
[95] Thomas Gabor, Andreas Sedlmeier, Marie Kiermeier, Thomy Phan, Marcel Henrich, Monika Pichlmair,
Bernhard Kempter, Cornel Klein, Horst Sauer, Reiner SchmidSiemens AG, et al. “Scenario co-evolution for
reinforcement learning on a grid world smart factory domain”. In: Proceedings of the Genetic and Evolutionary
Computation Conference. 2019.
[96] Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. “Are Quality Diversity Algorithms Better at
Generating Stepping Stones than Objective-Based Search?” In: Proceedings of the Genetic and Evolutionary
Computation Conference Companion. ACM. 2019, pp. 115–116.
[97] Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. “Data-efficient design exploration through
surrogate-assisted illumination”. In: Evolutionary computation 26.3 (2018), pp. 381–410.
[98] Adam Gaier, Alexander Asteroth, and Jean-Baptiste Mouret. “Discovering representations for black-box
optimization”. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference. 2020,
pp. 103–111.
[99] Adam Gaier, James Stoddart, Lorenzo Villaggi, and Peter J Bentley. “T-DominO: Exploring Multiple Criteria
with Quality-Diversity and the Tournament Dominance Objective”. In: Proceedings of the 17th International
Conference on Parallel Problem Solving from Nature. 2022.
[100] Raluca D Gaina, Adrien Couetoux, Dennis JNJ Soemers, Mark HM Winands, Tom Vodopivec, ¨
Florian Kirchgeßner, Jialin Liu, Simon M Lucas, and Diego Perez-Liebana. “The 2016 two-player gvgai
competition”. In: IEEE Transactions on Games 10.2 (2017), pp. 209–220.
[101] Federico A Galatolo, Mario GCA Cimino, and Gigliola Vaglini. “Generating images from caption and vice
versa via CLIP-Guided Generative Latent Space Search”. In: arXiv preprint arXiv:2102.01645 (2021).
[102] Alessio Gambi, Marc Mueller, and Gordon Fraser. “Automatically Testing Self-driving Cars with Search-based
Procedural Content Generation”. In: Proceedings of the 28th ACM SIGSOFT International Symposium on
Software Testing and Analysis. ISSTA 2019. Beijing, China: ACM, 2019, pp. 318–328. ISBN:
978-1-4503-6224-5. DOI: 10.1145/3293882.3330566.
[103] Pablo Garcia-Sanchez, Alberto Tonda, Antonio J Fernandez-Leiva, and Carlos Cotta. “Optimizing hearthstone
agents using an evolutionary algorithm”. In: Knowledge-Based Systems 188 (2020), p. 105032.
[104] Edoardo Giacomello, Pier Luca Lanzi, and Daniele Loiacono. “DOOM level generation using generative
adversarial networks”. In: 2018 IEEE Games, Entertainment, Media Conference (GEM). IEEE. 2018,
pp. 316–323.
[105] Jeremy H Gillula, Gabriel M Hoffmann, Haomiao Huang, Michael P Vitus, and Claire J Tomlin. “Applications
of hybrid reachability analysis to robotic aerial vehicles”. In: The International Journal of Robotics Research
30.3 (2011), pp. 335–354.
189
[106] Tobias Glasmachers, Tom Schaul, Sun Yi, Daan Wierstra, and Jurgen Schmidhuber. “Exponential natural ¨
evolution strategies”. In: Proceedings of the 12th annual conference on Genetic and evolutionary computation.
ACM. 2010, pp. 393–400.
[107] Andrew V. Goldberg and Robert E. Tarjan. “Efficient maximum flow algorithms”. In: Communications of the
ACM 57.8 (2014), pp. 82–89.
[108] Matthew Gombolay, Anna Bair, Cindy Huang, and Julie Shah. “Computational design of mixed-initiative
human–robot teaming that considers human factors: situational awareness, workload, and workflow
preferences”. In: The International journal of robotics research 36.5-7 (2017), pp. 597–617.
[109] Miguel Gonzalez-Duque, Rasmus Berg Palm, David Ha, and Sebastian Risi. “Finding Game Levels with the ´
Right Difficulty in a Few Trials through Intelligent Trial-and-Error”. In: arXiv preprint arXiv:2005.07677
(2020).
[110] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets”. In: Advances in Neural Information
Processing Systems. 2014, pp. 2672–2680.
[111] Deepak Gopinath, Siddarth Jain, and Brenna D Argall. “Human-in-the-loop optimization of shared autonomy
in assistive robotics”. In: IEEE Robotics and Automation Letters 2.1 (2016), pp. 247–254.
[112] Ronald L Graham, Donald E Knuth, Oren Patashnik, and Stanley Liu. “Concrete mathematics: a foundation for
computer science”. In: Computers in Physics 3.5 (1989), pp. 106–107.
[113] Daniele Gravina, Ahmed Khalifa, Antonios Liapis, Julian Togelius, and Georgios N Yannakakis. “Procedural
content generation through quality diversity”. In: 2019 IEEE Conference on Games (CoG). IEEE. 2019,
pp. 1–8.
[114] Michael Cerny Green, Luvneesh Mugrai, Ahmed Khalifa, and Julian Togelius. “Mario level generation from
mechanics using scene stitching”. In: 2020 IEEE Conference on Games (CoG). IEEE. 2020, pp. 49–56.
[115] Luca Grillotti and Antoine Cully. “Relevance-guided unsupervised discovery of abilities with quality-diversity
algorithms”. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2022, pp. 77–85.
[116] David Guera and Edward J Delp. “Deepfake video detection using recurrent neural networks”. In: ¨ 2018 15th
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE. 2018, pp. 1–6.
[117] Matthew Guzdial and Mark O Riedl. “Game Level Generation from Gameplay Videos.” In: AIIDE. 2016,
pp. 44–50.
[118] Alexander Hagg, Sebastian Berns, Alexander Asteroth, Simon Colton, and Thomas Back. “Expressivity of ¨
Parameterized and Data-Driven Representations in Quality Diversity Search”. In: Proceedings of the Genetic
and Evolutionary Computation Conference. GECCO ’21. Lille, France: Association for Computing Machinery,
2021, pp. 678–686. ISBN: 9781450383509. DOI: 10.1145/3449639.3459287.
[119] Alexander Hagg, Dominik Wilde, Alexander Asteroth, and Thomas Back. “Designing Air Flow with ¨
Surrogate-assisted Phenotypic Niching”. In: International Conference on Parallel Problem Solving from Nature.
Springer. 2020, pp. 140–153.
[120] Andreas Hald, Jens Struckmann Hansen, Jeppe Kristensen, and Paolo Burelli. “Procedural Content Generation
of Puzzle Games using Conditional Generative Adversarial Networks”. In: International Conference on the
Foundations of Digital Games. 2020, pp. 1–9.
190
[121] Eric Hambro, Sharada P. Mohanty, Dmitrii Babaev, Minwoo Byeon, Dipam Chakraborty, Edward Grefenstette,
Minqi Jiang, et al. “Insights From the NeurIPS 2021 NetHack Challenge”. In: CoRR abs/2203.11889 (2022).
DOI: 10.48550/arXiv.2203.11889.
[122] Nikolaus Hansen. “Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed”. In:
Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference:
Late Breaking Papers. 2009, pp. 2389–2396.
[123] Nikolaus Hansen. “The CMA evolution strategy: A tutorial”. In: arXiv preprint arXiv:1604.00772 (2016).
[124] Nikolaus Hansen, Anne Auger, Raymond Ros, Steffen Finck, and Petr Posˇ´ık. “Comparing Results of 31
Algorithms from the Black-Box Optimization Benchmarking BBOB-2009”. In: July 2010, pp. 1689–1696. DOI:
10.1145/1830761.1830790.
[125] Nikolaus Hansen and Stefan Kern. “Evaluating the CMA evolution strategy on multimodal test functions”. In:
International Conference on Parallel Problem Solving from Nature. Springer. 2004, pp. 282–291.
[126] Nikolaus Hansen, Sibylle D Muller, and Petros Koumoutsakos. “Reducing the time complexity of the ¨
derandomized evolution strategy with covariance matrix adaptation (CMA-ES)”. In: Evolutionary computation
11.1 (2003), pp. 1–18.
[127] Nikolaus Hansen and Andreas Ostermeier. “Completely derandomized self-adaptation in evolution strategies”.
In: Evolutionary computation 9.2 (2001), pp. 159–195.
[128] Nikolaus Hansen and Andreas Ostermeier. “Convergence properties of evolution strategies with the
derandomized covariance matrix adaptation: The (/I,)-ES”. In: Eufit 97 (1997), pp. 650–654.
[129] Peter E Hart, Nils J Nilsson, and Bertram Raphael. “A formal basis for the heuristic determination of minimum
cost paths”. In: IEEE transactions on Systems Science and Cybernetics 4.2 (1968), pp. 100–107.
[130] Kris Hauser. “Recognition, prediction, and planning for assisted teleoperation of freeform tasks”. In:
Autonomous Robots 35.4 (2013), pp. 241–254.
[131] Bradley Hayes and Julie A Shah. “Improving robot controller transparency through autonomous policy
explanation”. In: Proceedings of the ACM/IEEE international conference on human-robot interaction. 2017.
[132] Mark Hendrikx, Sebastiaan Meijer, Joeri Van Der Velden, and Alexandru Iosup. “Procedural content generation
for games: A survey”. In: ACM Transactions on Multimedia Computing, Communications, and Applications
(TOMM) 9.1 (2013), pp. 1–22.
[133] Laura V Herlant, Rachel M Holladay, and Siddhartha S Srinivasa. “Assistive teleopenordmoenration of robot
arms via automatic time-optimal mode switching”. In: The Eleventh ACM/IEEE International Conference on
Human Robot Interaction. IEEE Press. 2016, pp. 35–42.
[134] Ronald C Hofer, Edward Ramirez, and Scott H Smith. In: (1998).
[135] Guy Hoffman. “Evaluating fluency in human–robot collaboration”. In: IEEE Transactions on Human-Machine
Systems 49.3 (2019), pp. 209–218.
[136] Amy K Hoover, Julian Togelius, Scott Lee, and Fernando de Mesentier Silva. “The Many AI Challenges of
Hearthstone”. In: KI-Kunstliche Intelligenz ¨ (2019), pp. 1–11.
[137] Ian D Horswill and Leif Foged. “Fast procedural level population with playability constraints”. In: Eighth
Artificial Intelligence and Interactive Digital Entertainment Conference. 2012.
[138] HSReplay. HSReplay. https://hsreplay.net/. Accessed: 2019-11-01.
191
[139] IBM. IBM ILOG CPLEX Optimization Studio V12.10.0. 2019.
[140] Ali Jahanian, Lucy Chai, and Phillip Isola. “On the”steerability” of generative adversarial networks”. In:
International Conference on Learning Representations (ICLR). 2020.
[141] Ali Jahanian, Lucy Chai, and Phillip Isola. “On the” steerability” of generative adversarial networks”. In: arXiv
preprint arXiv:1907.07171 (2019).
[142] Rishabh Jain, Aaron Isaksen, Christoffer Holmgard, and Julian Togelius. “Autoencoders for level generation, ˚
repair, and recognition”. In: Proceedings of the ICCC Workshop on Computational Creativity and Games. 2016.
[143] Siddarth Jain and Brenna Argall. “Probabilistic human intent recognition for shared autonomy in assistive
robotics”. In: ACM Transactions on Human-Robot Interaction (THRI) 9.1 (2019), pp. 1–23.
[144] Siddarth Jain and Brenna Argall. “Recursive Bayesian human intent recognition in shared-control robotics”. In:
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2018,
pp. 3905–3912.
[145] Siddarth Jain, Ali Farshchiansadegh, Alexander Broad, Farnaz Abdollahi, Ferdinando Mussa-Ivaldi, and
Brenna Argall. “Assistive robotic manipulation through shared autonomy and a body-machine interface”. In:
2015 IEEE intnordmoenernational conference on rehabilitation robotics (ICORR). IEEE. 2015, pp. 526–531.
[146] Nick Jakobi. “Evolutionary Robotics and the Radical Envelope-of-Noise Hypothesis”. In: Adaptive Behavior
(1998). DOI: 10.1177/105971239700600205.
[147] Shervin Javdani, Henny Admoni, Stefania Pellegrinelli, Siddhartha S Srinivasa, and J Andrew Bagnell. “Shared
autonomy via hindsight optimization for teleoperation and teaming”. In: The International Journal of Robotics
Research (2018), p. 0278364918776060.
[148] Shervin Javdani, Siddhartha S. Srinivasa, and J. Andrew Bagnell. “Shared Autonomy via Hindsight
Optimization”. In: Robotics: Science and Systems XI. 2015. DOI: 10.15607/RSS.2015.XI.032.
[149] Hong Jun Jeon, Dylan P Losey, and Dorsa Sadigh. “Shared Autonomy with Learned Latent Actions”. In: arXiv
preprint arXiv:2005.03210 (2020).
[150] Minqi Jiang, Michael Dennis, Jack Parker-Holder, Jakob N. Foerster, Edward Grefenstette, and
Tim Rocktaschel. “Replay-Guided Adversarial Environment Design”. In: ¨ Advances in Neural Information
Processing Systems 34. 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/
0e915db6326b6fb6a3c56546980a8c93-Abstract.html.
[151] Yibin Jiang, Daniel Salley, Abhishek Sharma, Graham Keenan, Margaret Mullin, and Leroy Cronin. “An
artificial intelligence enabled chemical synthesis robot for exploration and optimization of nanomaterials”. In:
Science Advances (2022).
[152] Yaochu Jin and Jurgen Branke. “Evolutionary optimization in uncertain environments-a survey”. In: ¨ IEEE
Transactions on evolutionary computation 9.3 (2005), pp. 303–317.
[153] Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous univariate distributions,
volume 2. Vol. 289. John wiley & sons, 1995.
[154] Niels Justesen, Sebastian Risi, and Jean-Baptiste Mouret. “MAP-Elites for noisy domains by adaptive
sampling”. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019,
pp. 121–122.
192
[155] James Kapinski, Jyotirmoy V Deshmukh, Xiaoqing Jin, Hisahiro Ito, and Ken Butts. “Simulation-based
approaches for verification of embedded control systems: An overview of traditional and advanced modeling,
testing, and verification techniques”. In: IEEE Control Systems Magazine 36.6 (2016), pp. 45–64.
[156] Daniel Karavolos, Antonios Liapis, and Georgios N. Yannakakis. “A Multifaceted Surrogate Model for
Search-Based Procedural Content Generation”. In: IEEE Transactions on Games (2021).
[157] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for Improved
Quality, Stability, and Variation”. In: International Conference on Learning Representations (ICLR). 2018.
[158] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial
networks”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019,
pp. 4401–4410.
[159] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. “Analyzing and
Improving the Image Quality of StyleGAN”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). June 2020.
[160] Isaac Karth and Adam M Smith. “Addressing the fundamental tension of PCGML with discriminative learning”.
In: Proceedings of the 14th International Conference on the Foundations of Digital Games. 2019, pp. 1–9.
[161] Isaac Karth and Adam M Smith. “WaveFunctionCollapse is constraint solving in the wild”. In: Proceedings of
the 12th International Conference on the Foundations of Digital Games. 2017, pp. 1–10.
[162] Leon Keller, Daniel Tanneberg, Svenja Stark, and Jan Peters. “Model-Based Quality-Diversity Search for
Efficient Robot Learning”. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS) (2020), pp. 9675–9680.
[163] Paul Kent, Adam Gaier, Jean-Baptiste Mouret, and Juergen Branke. “Bayesian Optimisation for Quality
Diversity Search with coupled descriptor functions”. In: IEEE Transactions on Evolutionary Computation
(2024).
[164] Ahmed Khalifa, Michael Cerny Green, Gabriella Barros, and Julian Togelius. “Intentional computational level
design”. In: Proceedings of The Genetic and Evolutionary Computation Conference. 2019, pp. 796–803.
[165] Ahmed Khalifa, Scott Lee, Andy Nealen, and Julian Togelius. “Talakat: Bullet Hell Generation Through
Constrained Map-elites”. In: Proceedings of the Genetic and Evolutionary Computation Conference. GECCO
’18. Kyoto, Japan: ACM, 2018, pp. 1047–1054. ISBN: 978-1-4503-5618-3. DOI:
10.1"="145/3205455.3205470.
[166] Steven Orla Kimbrough, Gary J Koehler, Ming Lu, and David Harlan Wood. “On a feasible–infeasible
two-population (fi-2pop) genetic algorithm for constrained optimization: Distance tracing and no free lunch”.
In: European Journal of Operational Research 190.2 (2008), pp. 310–327.
[167] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings. 2015.
[168] Jonathan Kofman, Xianghai Wu, Timothy J Luu, and Siddharth Verma. “Teleoperation of a robot manipulator
using a vision-based human-robot interface”. In: IEEE Transactions on Industrial Electronics 52.5 (2005),
pp. 1206–1219.
[169] Hema S Koppula and Ashutosh Saxena. “Anticipating human activities using object affordances for reactive
robotic response”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 38.1 (2016), pp. 14–29.
193
[170] Thibault Kruse, Amit Kumar Pandey, Rachid Alami, and Alexandra Kirsch. “Human-aware robot navigation: A
survey”. In: Robotics and Autonomous Systems 61.12 (2013), pp. 1726–1743.
[171] Personal Robotics Lab. https://"="github.com/personalrobotics/ada/. 2017.
[172] Personal Robotics Lab.
https://github.com/personalrobotics/ada_assistance_policy/. 2017.
[173] Personal Robotics Lab. https://github.com/personalrobotics/prpy/. 2017.
[174] Thanh Mung Lam, Harmen Wigert Boschloo, Max Mulder, and Marinus M Van Paassen. “Artificial force field
for haptic feedback in UAV teleoperation”. In: IEEE Transactions on Systems, Man, and Cybernetics-Part A:
Systems and Humans 39.6 (2009), pp. 1316–1330.
[175] David H Lee, Anishalakshmi Palaparthi, Matthew C Fontaine, Bryon Tjanaka, and Stefanos Nikolaidis.
“Density Descent for Diversity Optimization”. In: Proceedings of the Genetic and Evolutionary Computation
Conference. 2024, pp. 674–682.
[176] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. “Learning quadrupedal
locomotion over challenging terrain”. In: Science robotics (2020).
[177] Joel Lehman and Kenneth O Stanley. “Revising the evolutionary computation abstraction: minimal criteria
novelty search”. In: Proceedings of the 12th annual conference on Genetic and evolutionary computation. 2010,
pp. 103–110.
[178] Joel Lehman and Kenneth O. Stanley. “Abandoning Objectives: Evolution through the Search for Novelty
Alone”. In: Evolutionary Computation (2011).
[179] Joel Lehman and Kenneth O. Stanley. “Evolving a Diversity of Virtual Creatures through Novelty Search and
Local Competition”. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation
(GECCO ‘11). ACM, 2011, pp. 211–218.
[180] Joel Lehman and Kenneth O. Stanley. “Exploiting Open-Endedness to Solve Problems through the Search for
Novelty”. In: Proceedings of the Eleventh International Conference on Artificial Life (Alife XI). 2008,
pp. 329–336.
[181] Ming Li and Allison M Okamura. “Recognition of operator motions fobertsimas1993simulatedr real-time
assistance using virtual fixtures”. In: 11th Symposium on Haptic Interfaces for Virtual Environment and
Teleoperator Systems, 2003. HAPTICS 2003. Proceedings. IEEE. 2003, pp. 125–131.
[182] Qinan Li, Weidong Chen, and Jingchuan Wang. “Dynamic shared control fChapteror human-wheelchair
cooperation”. In: 2011 IEEE International Conference on Robotics and Automation. IEEE. 2011,
pp. 4278–4283.
[183] Fei Liang. Tempo Rogue. https://www.hearthstonetopdecks.com/decks/tempo-roguerise-of-shadows-1-legend-etc/. Accessed: 2019-11-01.
[184] Antonios Liapis, Georgios N Yannakakis, and Julian Togelius. “Constrained novelty search: A study on game
content generation”. In: Evolutionary computation 23.1 (2015), pp. 101–129.
[185] Bryan Lim, Luca Grillotti, Lorenzo Bernasconi, and Antoine Cully. “Dynamics-aware quality-diversity for
efficient learning of skill repertoires”. In: 2022 International Conference on Robotics and Automation (ICRA).
IEEE. 2022, pp. 5360–5366.
194
[186] Michael L Littman, Anthony R Cassandra, and Leslie Pack Kaelbling. “Learning policies for partially
observable environments: Scaling up”. In: Machine Learning Proceedings 1995. Elsevier, 1995, pp. 362–370.
[187] Suyun Liu and Luis Nunes Vicente. “The stochastic multi-gradient algorithm for multi-objective optimization
and its application to supervised machine learning”. In: Annals of Operations Research (2021), pp. 1–30.
[188] Dylan P Losey, Krishnan Srinivasan, Ajay Mandlekar, Animesh Garg, and Dorsa Sadigh. “Controlling assistive
robots with learned latent actions”. In: 2020 IEEE International Conference on Robotics and Automation
(ICRA). IEEE. 2020, pp. 378–384.
[189] Simon M Lucas and Vanessa Volz. “Tile pattern KL-divergence for analysing and evolving game levels”. In:
Proceedings of the Genetic and Evolutionary Computation Conference. 2019, pp. 170–178.
[190] Jim Mainprice, E Akin Sisbot, Leonard Jaillet, Juan Cort ´ es, Rachid Alami, and Thierry Sim ´ eon. “Planning ´
human-aware motions using a sampling-based costmap planner”. In: 2011 IEEE International Conference on
Robotics and Automation. IEEE. 2011, pp. 5012–5017.
[191] Glenn A. Martin. “Automatic Scenario Generation using Procedural Modeling Techniques”. PhD thesis.
University of Central Florida, 2012.
[192] Glenn A. Martin, Charles E. Hughes, Sae Schatz, and Denise Nicholson. “The Use of Functional L-systems for
Scenario Generation in Serious Games”. In: Proceedings of the 2010 Workshop on Procedural Content
Generation in Games. PCGames ’10. Monterey, California: ACM, 2010, 6:1–6:5. ISBN:
978-1-4503-0023-0nordmoen. DOI: 10.1145/1814256.1814262.
[193] SC Martin Arjovsky and Leon Bottou. “Wasserstein Generative Adversarial Networks”. In: Proceedings of the
34 th International Conference on Machine Learning, Sydney, Australia. 2017.
[194] Jean-Eudes Marvie, Julien Perret, and Kadi Bouatouch. “FL-system : A Functional L-system for procedural
geometric modeling”. In: The Visual Computer 21 (June 2005), pp. 329–339. DOI:
10.1007/s00371-005-0289-z.
[195] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. “Active domain
randomization”. In: Conference on Robot Learning. PMLR. 2020, pp. 1162–1176.
[196] Karl Meinke and Peter Nycander. “Learning-based testing of distributed microservice architectures:
Correctness and fault injection”. In: SEFM 2015 Collocated Workshops. Springer. 2015, pp. 3–10.
[197] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, and Cynthia Rudin. “PULSE: Self-supervised photo
upsampling via latent space exploration of generative models”. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2020, pp. 2437–2445.
[198] Fernando de Mesentier Silva, Rodrigo Canaan, Scott Lee, Matthew C Fontaine, Julian Togelius, and
Amy K Hoover. “Evolving the hearthstone meta”. In: 2019 IEEE Conference on Games (CoG). IEEE. 2019,
pp. 1–8.
[199] Thomas M. Moerland, Joost Broekens, and Catholijn M. Jonker. “Model-based Reinforcement Learning: A
Survey”. In: CoRR abs/2006.16712 (2020). URL: https://arxiv.org/abs/2006.16712.
[200] Matej Moravcik, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis,
Kevin Waugh, Michael Johanson, and Michael H. Bowling. “DeepStack: Expert-Level Artificial Intelligence in
No-Limit Poker”. In: Science (2017).
[201] Douglas Morrison, Peter Corke, and Jurgen Leitner. “EGAD! an Evolved Grasping Analysis Dataset for
diversity and reproducibility in robotic manipulation”. In: IEEE Robotics and Automation Letters (2020).
195
[202] Jean-Baptiste Mouret and Jeff Clune. “Illuminating search spaces by mapping elites”. In: arXiv preprint
arXiv:1504.04909 (2015).
[203] Jean-Baptiste Mouret and Glenn Maguire. “Quality diversity for multi-task optimization”. In: Proceedings of
the 2020 Genetic and Evolutionary Computation Conference. 2020, pp. 121–129.
[204] Katharina Muelling, Arun Venkatraman, Jean-Sebastien Valois, John E Downey, Jeffrey Weiss,
Shervin Javdani, Martial Hebert, Andrew B Schwartz, Jennifer L Collinger, and J Andrew Bagnell. “Autonomy
infused teleoperation with application to brain computer interface controlled manipulation”. In: Autonomous
Robots 41 (2017), pp. 1401–1422.
[205] Galen E. Mullins, Paul G. Stankiewicz, R. Chad Hawthorne, and Satyandra K. Gupta. “Adaptive generation of
challenging scenarios for testing and evaluation of autonomous vehicles”. In: Journal of Systems and Software
137 (2018), pp. 197–215. ISSN: 0164-1212. DOI:
https://doi.org/10.1016/j.jss.2017.10.031.
[206] Richard M Murray, Zexiang Li, and S Shankar Sastry. A mathematical introduction to robotic manipulation.
CRC press, 2017.
[207] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. “Policy Invariance Under Reward Transformations: Theory
and Application to Reward Shaping”. In: Proceedings of the Sixteenth International Conference on Machine
Learning. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 278–287. ISBN:
1-55860-612-2. URL: http://dl.acm.org/citation.cfm?id=645528.657613.
[208] Anh Nguyen, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily fooled: High confidence
predictions for unrecognizable images”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2015, pp. 427–436.
[209] S. Nikolaidis, Zhu. Y., D. Hsu, and S.S. Srinivasa. “Human-Robot Mutual Adaptation in Shared Autonomy”. In:
Proc. Human Robot Interaction Conference. 2017.
[210] Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and Julie Shah. “Efficient model learning from
joint-action demonstrations for human-robot collaborative tasks”. In: 2015 10th ACM/IEEE International
Conference on Human-Robot Interaction (HRI). IEEE. 2015, pp. 189–196.
[211] Olle Nilsson and Antoine Cully. “Policy gradient assisted MAP-Elites”. In: Proceedings of the Genetic and
Evolutionary Computation Conference (2021).
[212] Jørgen Nordmoen, Kai Olav Ellefsen, and Kyrre Glette. “Combining MAP-Elites and Incremental Evolution to
Generate Gaits for a Mammalian Quadruped Robot”. In: Mar. 2018, pp. 719–733. ISBN: 978-3-319-77537-1.
DOI: 10.1007/978-3-319-77538-8_48.
[213] Jørgen Nordmoen, Eivind Samuelsen, Kai Olav Ellefsen, and Kyrre Glette. “Dynamic Mutation in MAP-Elites
for Robotic Repertoire Generation”. In: Artificial Life Conference Proceedings. MIT Press, 2018, pp. 598–605.
DOI: 10.1162/isal\_a\_00110.
[214] Matthew O’Brien, Ronald C Arkin, Dagan Harrington, Damian Lyons, and Shu Jiang. “Automatic verification
of autonomous robot missions”. In: International Conference on Simulation, Modeling, and Programming for
Autonomous Robots. Springer. 2014, pp. 462–473.
[215] Overcooked. Ghost Town Games, 2016. URL:
https://store.steampowered.com/app/448510/Overcooked/.
[216] Overcooked 2. Ghost Town Games, 2018. URL:
https://store.steampowered.com/app/728880/Overcooked_2/.
196
[217] Giuseppe Paolo, Alexandre Coninx, Stephane Doncieux, and Alban Laflaqui ´ ere. “Sparse reward exploration via `
novelty search and emitters”. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2021,
pp. 154–162.
[218] Christos H Papadimitriou and John N Tsitsiklis. “The complexity of Markov decision processes”. In:
Mathematics of operations research 12.3 (1987), pp. 441–450.
[219] Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. “Situation awareness, mental workload,
and trust in automation: Viable, empirically supported cognitive engineering constructs”. In: Journal of
cognitive engineering and decision making 2.2 (2008), pp. 140–160.
[220] Jack Parker-Holder, Minqi Jiang, Michael Dennis, Mikayel Samvelyan, Jakob Foerster, Edward Grefenstette,
and Tim Rocktaschel. “Evolving curricula with regret-based environment design”. In: ¨ International Conference
on Machine Learning. PMLR. 2022, pp. 17473–17498.
[221] Jack Parker-Holder, Aldo Pacchiano, Krzysztof M Choromanski, and Stephen J Roberts. “Effective diversity in
population based reinforcement learning”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 18050–18062.
[222] Stefania Pellegrinelli, Henny Admoni, Shervin Javdani, and Siddhartha S. Srinivasa. “Human-robot shared
workspace collaboration via hindsight optimization”. In: Proceedings of the International Conference on
Intelligent Robots and Systems, IROS. 2016. DOI: 10.1109/IROS.2016.7759147.
[223] Victor Perez. Generating Images from Prompts using CLIP and StyleGAN. MIT License. 2021. URL:
https://towardsdatascience.com/.
[224] Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas.
“General Video Game AI: A Multitrack Framework for Evaluating Agents, Games, and Content Generation
Algorithms”. In: IEEE Transactions on Games 11.3 (2019), pp. 195–214.
[225] Diego Perez-Liebana, Spyridon Samothrakis, Julian Togelius, Tom Schaul, and Simon M Lucas. “General
video game ai: Competition, challenges and opportunities”. In: Thirtieth AAAI Conference oncarrol Artificial
Intelligence. 2016.
[226] Thomas Pierrot, Valentin Mace, Felix Chalumeau, Arthur Flajolet, Geoffrey Cideron, Karim Beguir, ´
Antoine Cully, Olivier Sigaud, and Nicolas Perrin-Gilbert. “Diversity policy gradient for sample efficient
quality-diversity optimization”. In: Proceedings of the Genetic and Evolutionary Computation Conference.
2022, pp. 1075–1083.
[227] Kuang Ping and Luo Dingli. “Conditional Convolutional Generative Adversarial Networks Based Interactive
Procedural Game Map Generation”. In: Future of Information and Communication Conference. Springer. 2020,
pp. 400–419.
[228] Mike Preuss. “Improved Topological Niching for Real-valued Global Optimization”. In: Proceedings of the
2012T European Conference on Applications of Evolutionary Computation. EvoApplications’12.
Málaga, Spain: Springer-Verlag, 2012, pp. 386–395. ISBN: 978-3-642-29177-7. DOI:
10.1007/978-3-642-29178-4_39.
[229] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. “Quality diversity: A new frontier for evolutionary
computation”. In: Frontiers in Robotics and AI 3 (2016), p. 40.
[230] Justin K. Pugh, L. B. Soros, Paul A. Szerlip, and Kenneth O. Stanley. “Confronting the Challenge of Quality
Diversity”. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation.
GECCO ’15. Madrid, Spain: ACM, 2015, pp. 967–974. ISBN: 978-1-4503-3472-3. DOI:
10.1145/2739480.2754664.
197
[231] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. “Learning Transferable
Visual Models From Natural Language Supervision”. In: Proceedings of the 38th International Conference on
Machine Learning ICML 2021. PMLR. 2021, pp. 8748–8763.
[232] Nemanja Rakicevic, Antoine Cully, and Petar Kormushev. “Policy Manifold Search: Exploring the Manifold
Hypothesis for Diversity-Based Neuroevolution”. In: Proceedings of the Genetic and Evolutionary
Computation Conference. GECCO ’21. Lille, France: Association for Computing Machinery, 2021,
pp. 901–909. ISBN: 9781450383509. DOI: 10.1145/3449639.3459320.
[233] Pratyusha Rakshit, Amit Konar, and Swagatam Das. “Noisy evolutionary optimization algorithms–a
comprehensive survey”. In: Swarm and Evolutionary Computation 33 (2017), pp. 18–45.
[234] Siddharth Reddy, Anca D Dragan, and Sergey Levine. “Shared autonomy via deep reinforcement learning”. In:
arXiv preprint arXiv:1802.01744 (2018).
[235] Jing Ren, Rajni V Patel, Kenneth A McIsaac, Gerard Guiraudon, and Terry M Peters. “Dynamic 3-D virtual
fixtures for minimally invasive beating heart procedures”. In: IEEE transactions on medical imaging 27.8
(2008), pp. 1061–1070.
[236] Sebastian Risi and Julian Togelius. “Increasing generality in machine learning through procedural content
generation”. In: Nature Machine Intelligence (2020).
[237] E. Rocklage, H. Kraft, A. Karatas, and J. Seewig. “Automated scenario generation for regression testing of
autonomous vehicles”. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems
(ITSC). Oct. 2017, pp. 476–483. DOI: 10.1109/ITSC.2017.8317919.
[238] Michael Ruchte and Josif Grabocka. “Efficient Multi-Objective Optimization for Deep Learning”. In: arXiv
preprint arXiv:2103.13392 (2021).
[239] Stuart Russell and Peter Norvig. “Artificial intelligence: a modern approach”. In: (2002).
[240] Fereshteh Sadeghi and Sergey Levine. “CAD2RL: Real Single-Image Flight Without a Single Real Image”. In:
Proceedings of Robotics: Science and Systems XIII. 2017. DOI: 10.15607/RSS.2017.XIII.034.
[241] Dorsa Sadigh, S Shankar Sastry, and Sanjit A Seshia. “Verifying robustness of human-aware autonomous cars”.
In: IFAC-PapersOnLine (2019).
[242] Dorsa Sadigh, S Shankar Sastry, Sanjit A Seshia, and Anca Dragan. “Information gathering actions over human
internal state”. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.
2016, pp. 66–73.
[243] Anurag Sarkar and Seth Cooper. “Generating and Blending Game Levels via Quality-Diversity in the Latent
Space of a Variational Autoencoder”. In: arXiv preprint arXiv:2102.12463 (2021).
[244] Jacob Schrum, Vanessa Volz, and Sebastian Risi. “CPPN2GAN: Combining compositional pattern producing
networks and gans for large-scale pattern generation”. In: Proceedings of the 2020 Genetic and Evolutionary
Computation Conference. 2020, pp. 139–147.
[245] Konstantinos Sfikas, Antonios Liapis, and Georgios N. Yannakakis. “Monte Carlo Elites: Quality-Diversity
Selection as a Multi-Armed Bandit Problem”. In: Proceedings of the Genetic and Evolutionary Computation
Conference. GECCO ’21. Lille, France: Association for Computing Machinery, 2021, pp. 180–188. ISBN:
9781450383509. DOI: 10.1145/3449639.3459321.
198
[246] Noor Shaker, Julian Togelius, and Mark J. Nelson. Procedural Content Generation in Games: A Textbook and
an Overview of Current Research. Springer, 2016.
[247] Ofer M Shir and Thomas Back. “Niche radius adaptation in the cma-es niching algorithm”. In: ¨ International
Conference on Parallel Problem Solving from Nature. Springer. 2006, pp. 142–151.
[248] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature
(2016). DOI: 10.1038/nature16961.
[249] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,
Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. “A general reinforcement learning
algorithm that masters chess, shogi, and Go through self-play”. In: Science (2018).
[250] Ruben M. Smelik, Tim Tutenel, Rafael Bidarra, and Bedrich Benes. “A Survey on Procedural Modelling for
Virtual Worlds”. In: Comput. Graph. Forum 33.6 (Sept. 2014), pp. 31–50. ISSN: 0167-7055. DOI:
10.1111/cgf.12276.
[251] Adam M Smith and Michael Mateas. “Answer set programming for procedural content generation: A design
space approach”. In: IEEE Transactions on Computational Intelligence and AI in Games 3.3 (2011),
pp. 187–200.
[252] Davy Smith, Laurissa Tokarchuk, and Geraint Wiggins. “Rapid phenotypic landscape exploration through
hierarchical spatial partitioning”. In: International conference on parallel problem solving from nature.
Springer. 2016, pp. 911–920.
[253] Gillian Smith, Jim Whitehead, and Michael Mateas. “Tanagra: Reactive planning and constraint solving for
mixed-initiative level design”. In: IEEE Transactions on computational intelligence and AI in games 3.3 (2011),
pp. 201–215.
[254] Sam Snodgrass and Santiago Ontanon. “Player movement models for platformer game level generation”. In: ´
Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press. 2017,
pp. 757–763.
[255] Sam Snodgrass and Santiago Ontan˜on. “Experiments in map generation using Markov chains.” In: ´ FDG. 2014.
[256] Nathan Sorenson and Philippe Pasquier. “Towards a generic framework for automated video game level
creation”. In: European conference on the applications of evolutionary computation. Springer. 2010,
pp. 131–140.
[257] Kenneth O Stanley. “Compositional Pattern Producing Networks: A Novel Abstraction of Development”. In:
Genetic programming and evolvable machines 8.2 (2007), pp. 131–162.
[258] Kenneth O Stanley, Nick Cheney, and LB Soros. “How the strictness of the minimal criterion impacts
open-ended evolution”. In: ALIFE 2016, the Fifteenth International Conference on the Synthesis and
Simulation of Living Systems. MIT Press. 2016, pp. 208–215.
[259] Kirby Steckel and Jacob Schrum. “Illuminating the Space of Beatable Lode Runner Levels Produced by
Various Generative Adversarial Networks”. In: Proceedings of the Genetic and Evolutionary Computation
Conference Companion. GECCO ’21. Lille, France: Association for Computing Machinery, 2021, pp. 111–112.
ISBN: 9781450383516. DOI: 10.1145/3449726.3459440.
[260] Aaron Steinfeld, Odest Chadwicke Jenkins, and Brian Scassellati. “The oz of wizard: simulating the human for
interaction research”. In: Proceedings of the 4th ACM/IEEE international conference on Human robot
interaction. 2009, pp. 101–108.
199
[261] Nathan Sturtevant, Nicolas Decroocq, Aaron Tripodi, and Matthew Guzdial. “The unexpected consequence of
incremental design changes”. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive
Digital Entertainment. 2020.
[262] Adam Summerville and Michael Mateas. “Super mario as a string: Platformer level generation via lstms”. In:
arXiv preprint arXiv:1603.00930 (2016).
[263] Adam Summerville, Sam Snodgrass, Matthew Guzdial, Christoffer Holmgard, Amy K Hoover, Aaron Isaksen, ˚
Andy Neasituationallen, and Julian Togelius. “Procedural content generation via machine learning (pcgml)”. In:
IEEE Transactions on Games 10.3 (2018), pp. 257–270.
[264] Maciej Swiechowski, Tomasz Tajmajer, and Andrzej Janusz. “Improving Hearthstone AI by Combining MCTS ´
and Supervised Learning Algorithms”. In: 2018 IEEE Conference on Computational Intelligence and Games
(CIG). IEEE, 2018, pp. 1–8.
[265] Andrea Thomaz, Guy Hoffman, and Maya Cakmak. “Computational human-robot interaction”. In: Foundations
and Trends in Robotics 4.2-3 (2016), pp. 105–223.
[266] Bryon Tjanaka, Matthew C Fontaine, David H Lee, Yulun Zhang, Nivedit Reddy Balam, Nathaniel Dennler,
Sujay S Garlanka, Nikitas Dimitri Klapsis, and Stefanos Nikolaidis. “pyribs: A bare-bones python library for
quality diversity optimization”. In: Proceedings of the Genetic and Evolutionary Computation Conference.
2023, pp. 220–229.
[267] Bryon Tjanaka, Matthew C Fontaine, Julian Togelius, and Stefanos Nikolaidis. “Approximating gradients for
differentiable quality diversity in reinforcement learning”. In: Proceedings of the Genetic and Evolutionary
Computation Conference. 2022, pp. 1102–1111.
[268] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. “Domain
randomization for transferring deep neural networks from simulation to the real world”. In: Proceedings of the
IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. 2017. DOI:
10.1109/IROS.2017.8202133.
[269] Julian Togelius, Sergey Karakovskiy, and Robin Baumgarten. “The 2009 Mario AI Competition”. In:
Proceedings of the IEEE Congress on Evolutionary Computation, CEC. 2010. DOI:
10.1109/CEC.2010.5586133.
[270] Julian Togelius, Georgios N Yannakakis, Kenneth O Stanley, and Cameron Browne. “Search-based procedural
content generation: A taxonomy and survey”. In: IEEE Transactions on Computational Intelligence and AI in
Games 3.3 (2011), pp. 172–186.
[271] Ruben Rodriguez Torrado, Ahmed Khalifa, Michael Cerny Green, Niels Justesen, Sebastian Risi, and
Julian Togelius. “Bootstrapping Conditional GANs for Video Game Level Generation”. In: IEEE Conference
on Games. 2020.
[272] Pete Trautman. “Assistive planning in complex, dynamic environments: a probabilistic approach”. In: 2015
IEEE International Conference on Systems, Man, and Cybernetics. IEEE. 2015, pp. 3072–3078.
[273] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger.
“Deep feature interpolation for image content changes”. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2017, pp. 7064–7073.
[274] Vassilis Vassiliades, Konstantinos Chatzilygeroudis, and Jean-Baptiste Mouret. “Using Centroidal Voronoi
Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm”. In: IEEE
Transactions on Evolutionary Computation 22.4 (2018), pp. 623–630. DOI:
10.1109/TEVC.2017.2735550.
200
[275] Vassilis Vassiliades and Jean-Baptiste Mouret. “Discovering the Elite Hypervolume by Leveraging Interspecies
Correlation”. In: Proceedings of the Genetic and Evolutionary Computation Conference. 2018, pp. 149–156.
[276] Eduardo Veras, Karan Khokar, Redwan Alqasemi, and Rajiv Dubey. “Scaled telerobotic control of a
manipulator in real time with laser assistance for ADL tasks”. In: Journal of the nordmoenFranklin Institute
349.7 (2012), pp. 2268–2280.
[277] Roman Vershynin. Random Vectors in High Dimensions, page 3869. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, 2018.
[278] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, Andrew Dudzik, Junyoung Chung, ¨
David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster Level in StarCraft II Using
Multi-Agent Reinforcement Learning”. In: Nature (2019).
[279] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michael Mathieu, Andrew Dudzik, Junyoung Chung, ¨
David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using
multi-agent reinforcement learning”. In: Nature 575.11 (2019), pp. 350–354.
[280] Vanessa Volz, Jacob Schrum, Jialin Liu, Simon M Lucas, Adam Smith, and Sebastian Risi. “Evolving mario
levels in the latent space of a deep convolutional generative adversarial network”. In: Proceedings of the
Genetic and Evolutionary Computation Conference. 2018, pp. 221–228.
[281] Rose E Wang, Sarah A Wu, James A Evans, Joshua B Tenenbaum, David C Parkes, and Max Kleiman-Weiner.
“Too many cooks: Bayesian inference for coordinating multi-agent collaboration”. In: arXiv e-prints (2020),
arXiv–2003.
[282] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. “POET: open-ended coevolution of environments
and their optimized solutions”. In: Proceedings of the Genetic and Evolutionary Computation Conference.
GECCO ’19. Prague, Czech Republic: Association for Computing Machinery, 2019, pp. 142–151. ISBN:
9781450361118. DOI: 10.1145/3321707.3321799.
[283] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeffrey Clune, and Kenneth Stanley. “Enhanced
poet: Open-ended reinforcement learning through unbounded invention of learning challenges and their
solutions”. In: International conference on machine learning. PMLR. 2020, pp. 9940–9951.
[284] Po-Wei Wang, Priya L Donti, Bryan Wilder, and Zico Kolter. “SATNet: Bridging deep learning and logical
reasoning using a differentiable satisfiability solver”. In: (2019).
[285] Zhikun Wang, Katharina Mulling, Marc Peter Deisenroth, Heni Ben Amor, David Vogt, Bernhard Sch ¨ olkopf, ¨
and Jan Peters. “Probabilistic movement modeling for intention inference in human-robot interaction”. In: The
International Journal of Robotics Research 32.7 (2013), pp. 841–858.
[286] Tom White. “Sampling generative networks”. In: arXiv preprint arXiv:1609.04468 (2016).
[287] Bryan Wilder, Bistra Dilkina, and Milind Tambe. “Melding the data-decisions pipeline: Decision-focused
learning for combinatorial optimization”. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Vol. 33. 2019, pp. 1658–1665.
[288] Peter R Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian,
Thomas J Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. “Outracing
champion Gran Turismo drivers with deep reinforcement learning”. In: Nature (2022).
[289] Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Muller. “Learning to Combat ¨
Compounding-Error in Model-Based Reinforcement Learning”. In: CoRR abs/1912.11206 (2019). URL:
http://arxiv.org/abs/1912.11206.
201
[290] Ruolin Ye, Wenqiang Xu, Haoyuan Fu, Rajat Kumar Jenamani, Vy Nguyen, Cewu Lu,
Katherine Dimitropoulou, and Tapomayukh Bhattacharjee. “RCare World: A Human-centric Simulation World
for Caregiving Robots”. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). 2022.
[291] Hejia Zhang, Matthew Fontaine, Amy Hoover, Julian Togelius, Bistra Dilkina, and Stefanos Nikolaidis. “Video
game level repair via mixed integer linear programming”. In: Proceedings of the AAAI Conference on Artificial
Intelligence and Interactive Digital Entertainment. Vol. 16. 1. 2020, pp. 151–158.
[292] Yulun Zhang, Matthew C Fontaine, Amy K Hoover, and Stefanos Nikolaidis. “Deep surrogate assisted
map-elites for automated hearthstone deckbuilding”. In: Proceedings of the Genetic and Evolutionary
Computation Conference. 2022, pp. 158–167.
[293] Yulun Zhang, Bryon Tjanaka, Matthew C. Fontaine, and Stefanos Nikolaidis. “Illuminating the Latent Space of
an MNIST GAN”. In: pyribs.org (2021). MIT License. URL:
https://docs.pyribs.org/en/stable/tutorials/lsi_mnist.html.
[294] Qianchuan Zhao, Bruce H Krogh, and Paul Hubbard. “Generating test inputs for embedded control systems”.
In: IEEE Control Systems Magazine 23.4 (2003), pp. 49–57.
[295] Yilun Zhou, Serena Booth, Nadia Figueroa, and Julie Shah. “RoCUS: Robot Controller Understanding via
Sampling”. In: Proceedings of the Conference on Robot Learning. 2021. URL:
https://proceedings.mlr.press/v164/zhou22a.html.
[296] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. “Maximum Entropy Inverse
Reinforcement Learning.” In: Proc. AAAI Conference on Artificial Intelligence. 2008, pp. 1433–1438.
[297] Alexander Zook, Stephen Lee-Urban, Mark O Riedl, Heather K Holden, Robert A Sottilare, and
Keith W Brawner. “Automated scenario generation: toward tailored and optimized military training in virtual
environments”. In: Proceedings of the international conference on the foundations of digital games. 2012,
pp. 164–171.
202
Abstract (if available)
Abstract
As robots become more complex and start to enter our daily lives, the interaction between humans and robots will also become more complex. Designers of robotic systems will start to struggle with anticipating how a robot will act in different environments and with different users. The defacto standard for evaluating human-robot interaction has been human subjects experiments. However, research labs can only run at most hundreds of trials when evaluating a new system under typical time and resource constraints of research labs, limiting the variety of interaction scenarios each experiment can cover. As a complement to human subjects experiments, this dissertation proposes algorithmically generating scenarios to evaluate human-robot interaction systems, where a scenario consists of an environment and simulated human. An environment consists of initial object locations in a scene and the initial configuration of a robot, while we represent simulated humans as a parameterized agent that produces actions humans may take. The HRI field evaluates algorithms as closed loop systems, where human behavior affects the robots actions and vice-versa. For generality, my proposed systems treat the robotic system as a black-box and searches over the scenario parameters to automatically find novel failures. By thoroughly evaluating such algorithms in simulation, researchers can scale to more complex HRI algorithms, and industry will be able to better understand the capabilities of proposed HRI systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Decision support systems for adaptive experimental design of autonomous, off-road ground vehicles
PDF
Algorithms and systems for continual robot learning
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Closing the reality gap via simulation-based inference and control
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Nonverbal communication for non-humanoid robots
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
AI-driven experimental design for learning of process parameter models for robotic processing applications
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
On virtual, augmented, and mixed reality for socially assistive robotics
PDF
Scaling robot learning with skills
PDF
Learning affordances through interactive perception and manipulation
PDF
Advancements in understanding the empirical hardness of the multi-agent pathfinding problem
PDF
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Automated alert generation to improve decision-making in human robot teams
PDF
Accelerating robot manipulation using demonstrations
Asset Metadata
Creator
Fontaine, Matthew Christopher
(author)
Core Title
Quality diversity scenario generation for human robot interaction
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-12
Publication Date
12/20/2024
Defense Date
11/04/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
generative models,human robot interaction,mixed integer programming,OAI-PMH Harvest,quality diversity optimization,scenario generation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nikolaidis, Stefanos (
committee chair
), Gupta, Satyandra (
committee member
), Sukhatme, Gaurav (
committee member
), Togelius, Julian (
committee member
)
Creator Email
mfontain@usc.edu,tehqin@gmail.com
Unique identifier
UC11399EZL1
Identifier
etd-FontaineMa-13707.pdf (filename)
Legacy Identifier
etd-FontaineMa-13707
Document Type
Dissertation
Format
theses (aat)
Rights
Fontaine, Matthew Christopher
Internet Media Type
application/pdf
Type
texts
Source
20241223-usctheses-batch-1230
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
generative models
human robot interaction
mixed integer programming
quality diversity optimization
scenario generation