Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sample-efficient and robust neurosymbolic learning from demonstrations
(USC Thesis Other)
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SAMPLE-EFFICIENT AND ROBUST NEUROSYMBOLIC
LEARNING FROM DEMONSTRATIONS
by
Aniruddh Gopinath Puranic
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Aniruddh Gopinath Puranic
Dedication
I dedicate this thesis to my parents and my special family members.
ii
Acknowledgements
The completion of this dissertation was made possible through the invaluable support and encouragement
of my collaborators, mentors, friends, and family. I am indebted to my advisors Jyotirmoy V. Deshmukh and
Stefanos Nikolaidis for their profound impact on my learning and achievements. Their emphasis on technical precision, experimental prowess, and effective communication has profoundly influenced the quality
of my work. I am sincerely thankful for their unwavering commitment to nurturing my development as a
researcher, and I consider myself incredibly fortunate to have had them guide me. Additionally, I express
my heartfelt gratitude to my committee members, Gaurav Sukhatme, Stephen Tu, Rahul Jain, Mukund
Raghothaman, Somil Bansal and Julie Shah (MIT), for their invaluable insights and feedback, which have
significantly contributed to shaping the content of my thesis. I would also like to thank my collaborators:
Dr. Andrew Hung for giving me the incredible opportunity to work with the da Vinci Surgical System at
the USC Keck School of Medicine; and Lane Desborough for his insightful discussions and mentorship in
my early work on developing safe and robust controllers for automated insulin delivery systems. I wish
to extend my appreciation to all the members of CPS-VIDA and ICAROS who have actively participated
in discussions related to my research endeavors. Last but not least, I express my heartfelt gratitude to my
parents for their unwavering support.
I would like to thank various organizations for providing my advisor with the funding for my research, including the National Science Foundation through the following grants: NSF CAREER award
(SHF-2048094), CNS-1932620, CNS-2039087, FMitF-1837131, CCF-SHF-1932620, and the Toyota Research
Institute North America through the USC Center for Autonomy and AI.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Concepts and Mathematical Representations . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Signal Temporal Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Quantitative Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3: Temporal Logic-Guided Reward Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 LfD-STL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Specification Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Reward Inference Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Reward Assignments in Discrete Environments . . . . . . . . . . . . . . . . . . . . 20
3.3.1.1 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1.2 Global Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Reward Assignments in Continuous Environments . . . . . . . . . . . . . . . . . . 26
3.4 Learning Policies from Inferred Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 Discrete-Space Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Continuous-Space Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.3 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 4: Learning Performance Graphs from Demonstrations . . . . . . . . . . . . . . . . . . . . 49
4.1 PeGLearn Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 PeGLearn Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.2 Generating local graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
iv
4.1.3 Aggregation of local graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.4 Conversion/Reduction to weighted DAG . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 5: Apprenticeship Learning with STL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 AL-STL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Performance Graphs as the Optimization Objective . . . . . . . . . . . . . . . . . . 81
5.1.2 Framework and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.2.1 Frontier Update Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.2.2 Policy Improvement Analysis . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.2.3 Effect of Affine Transformations to Rewards . . . . . . . . . . . . . . . . 88
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 6: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
v
List of Tables
2.1 Quantitative Semantics of STL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Quantitative comparisons between MCE-IRL and our method for different environments. . 44
4.1 D4PG hyperparameters for the MiR100 safe mobile robot navigation environment. . . . . . 67
5.1 Hyperparameters for all tasks used to evaluate AL-STL. . . . . . . . . . . . . . . . . . . . . 101
vi
List of Figures
2.1 Demonstrations in a grid-world. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Weights on nodes in a DAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Ideal framework for integrating LfD and STL to infer reward functions and robot policy. . 16
3.2 P yGame user-interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Results for 5 × 5 grid-world. (a)-(c): Left figures represent learned rewards. Right figures
show the grid-world with start state (light blue), goal (dark blue), obstacles (red) and
demonstration/policy (green). (e) MCE-IRL rewards with over 40 optimal demonstrations. 32
3.4 Results for 5 × 5 grid with 20% stochasticity: Inferred rewards are shown in left figures.
Right figures show the grid-world with start state (light blue), goal (dark blue), obstacles
(red) and demonstration/policy (green). (e) shows the rewards extracted by MCE-IRL
using 300 optimal demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Results for 4 × 4 FrozenLake on train and test maps. . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Results for 8×8 FrozenLake. Left subfigures represent the reward and the right subfigures
show the environment and policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 (a) and (b): The left figures represent the simulator reward (1 at goal and 0 elsewhere) while
the right figures show the rewards based on STL specification. (c) Rewards inferred from
demonstrations. Note: In all the figures, the axes represent the cell numbers corresponding to
the grid size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Statistics indicating the exploration rate of each algorithm as well as rewards accumulated
in each training episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Comparisons of LfD+STL with hand-crafted rewards+Q-Learning for OpenAI Gym
environments. (a) and (b) pertain to Frozenlake and (c) pertains to Mountain Car. . . . . . 35
3.10 Results for 7 × 7 sequential goal grid-world. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
3.11 The car always starts in the top-left corner and the task is to navigate to the goal (in
yellow) while possibly avoiding potholes or obstacles (purple). A sample demonstration is
shown by the green trajectory. A possible ground truth reward function is +10 anywhere
in the yellow region and -5 in the purple region. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 (a)-(d): Robustness of each demonstration w.r.t. the STL specifications. Blue trajectories
indicate positive robustness and red indicate negative. (e): Final rewards based on
cumulative robustness and demonstration ranking. (f): Reward approximation using
neural networks. The yellow-shaded region represents the workspace of the agent, i.e., it
is not allowed to leave that region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.13 Results showing effects of hyperparameters on cumulative rewards. . . . . . . . . . . . . . 40
3.14 Results showing effects of hyperparameters on predicted rewards. . . . . . . . . . . . . . . 40
3.15 Comparing rewards with ground truth and state-of-the-art MCE-IRL. . . . . . . . . . . . . 43
3.16 Ground-truth (GT) rewards and rewards extracted by ME-IRL and MCE-IRL, respectively,
each using 300 optimal demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Overview of PeGLearn algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Example local graph for a demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Example global graph from 2 demonstrations. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Demonstrations collected for the 2D car simulator. . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Results for the 2-D autonomous driving simulator. Baseline rewards are from user-defined
DAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Results for the MiR100 navigation environment. (a) Obstacles and boundary walls are
shown in red point clouds; green sphere is the goal. . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Teleoperation demonstrations in CARLA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8 A frame from one of the survey videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.9 DAGs for the CARLA simulator experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.10 Comparison of specification orderings between humans and PeGLearn. . . . . . . . . . . . 72
4.11 DAGs for the Knot-Tying task. (a)-(c) DAG for each level of expertise: Experts,
Intermediates and Novices respectively. (d) DAG for all surgeons, without discriminating
expertise levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 AL-STL Framework with Performance-Graph Advantage. . . . . . . . . . . . . . . . . . . . 84
viii
5.2 Overview of the robot simulation environments. The task in (d) uses the Nvidia Isaac
simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 AL-STL results for the 4x4 and 8x8 Frozenlake environments. . . . . . . . . . . . . . . . . 94
5.4 Summary of training and evaluations for the pose-reaching tasks. . . . . . . . . . . . . . . 95
5.5 Summary of training and evaluations for the Cube-Placing task. . . . . . . . . . . . . . . . 96
5.6 Summary of training and evaluations for the Door-Opening task. . . . . . . . . . . . . . . 97
5.7 Summary of training and evaluations for the Safe Mobile Navigation task. . . . . . . . . . 99
5.8 Summary of training and evaluations for the Safe FreightFranka Cabinet Drawer task. . . . 100
ix
Abstract
Learning-from-demonstrations (LfD) is a popular paradigm to obtain effective robot control policies for
complex tasks via reinforcement learning (RL) without the need to explicitly design reward functions.
However, it is susceptible to imperfections in demonstrations and also raises concerns of safety and interpretability in the learned control policies. To address these issues, this thesis develops a neurosymbolic
learning framework which is a hybrid method that integrates neural network-based learning with symbolic (e.g., rule, logic, graph) reasoning to leverage the strengths of both approaches. Specifically, this
framework uses Signal Temporal Logic (STL) to express high-level robotic tasks and its quantitative semantics to evaluate and rank the quality of demonstrations. Temporal logic-based specifications enable the
creation of non-Markovian rewards, and are capable of defining interesting causal dependencies between
tasks such as sequential task specifications. This dissertation first presents the LfD-STL framework that
learns from even suboptimal/imperfect demonstrations and STL specifications to infer reward functions;
these reward functions can then be used by reinforcement learning algorithms to obtain control policies.
Experimental evaluations on several diverse sets of environments show that the additional information in
the form of formally-specified task objectives allows the framework to outperform prior state-of-the-art
LfD methods.
Many real-world robotic tasks consist of multiple objectives (specifications), some of which may be
inherently competitive, thus prompting the need for deliberate trade-offs. This dissertation then further
extends the LfD-STL framework by a developing metric - performance graph - which is a directed graph
that utilizes the quality of demonstrations to provide intuitive explanations about the performance and
x
trade-offs of demonstrated behaviors. This performance graph also offers concise insights into the learning
process of the RL agent, thereby enhancing interpretability, as corroborated by a user study. Finally, the
thesis discusses how the performance graphs can be used as an optimization objective to guide RL agents
to potentially learn policies that perform better than the (imperfect) demonstrators via apprenticeship
learning (AL). The theoretical machinery developed for the AL-STL framework examines the guarantees
on safety and performance of RL agents.
xi
Chapter 1
Introduction
Conventionally, robots have been developed and programmed for single/specific purposes as in manufacturing, warehouses, etc. Programming such robots requires domain expertise to understand the capabilities
of the robot, task and environment. Such robots cannot be generalized to other tasks, i.e., they need to be
reprogrammed for newer tasks. The rapid evolution in Artificial Intelligence (AI) has aimed to provide such
generalizations, and robots such as autonomous vehicles, assistive and medical/surgical robots, etc., are increasingly becoming commonplace. Another contribution of AI is to alleviate difficulties in programming
the robots for individual tasks and extracting control policies, which would otherwise be tedious.
In human-robot interaction (HRI), one of the emerging AI methods to design control policies for robots
is the paradigm of learning-from-demonstrations (LfD) [11, 104]. In LfD, a robot learns by observing a
human or another robot perform a task - either by inferring a direct mapping from observations to actions
or inferring the human intentions/behaviors. Demonstrations are usually provided in 3 ways: (i) kinesthetic
teaching - the human hand-holds the robot and hence, its actuators through a task, (ii) teleoperation - the
human controls a robot via remote controller (e.g., a joystick) to perform a task, and (iii) visual learning -
human demonstrations are recorded directly using cameras (e.g., YouTube videos) and/or motion capture
sensors. The robot is required to learn the correspondence between human actions and the robot’s own
body and actuators, and finally infer intended behaviors.
1
The LfD algorithms can be broadly classified into two main categories based on the characteristic of
the demonstrator they model as follows:
1. Mimicking actions: The methods in this category are also referred to as imitation learning (IL). The
goal here is to directly mimic the actions of the demonstrator or teacher, i.e., the method learns a
direct mapping from states to actions. Behavior cloning via supervised learning [110], DAgger [101],
adversarial imitation learning [52], etc. are some popular algorithms in this group.
2. Mimicking intent: These methods model the overall intent or goal of the demonstrators via reward
functions, from which a policy that maps states to actions is inferred using reinforcement learning
(RL). The reward functions can either be: (i) explicitly defined and demonstrations will facilitate
efficient initialization of the RL policy [83], or (ii) inferred via inverse reinforcement learning (IRL)
[86, 123], from which a control policy can be extracted by apprenticeship learning (AL) [2].
At its core, LfD provides a mechanism to indirectly provide specifications on the expected behaviors of
a robot, and learning a control policy from these specifications. LfD can also address the issue in designing
rewards for multiple objectives as demonstrations show how to perform trade-offs among the objectives.
However, there are methodological limitations to the prevalent LfD paradigm:
1. Incompleteness: A demonstration is an inherently incomplete and implicit specification of the
robot behavior in a specific fixed initial configuration or in the presence of a single disturbance
profile (i.e., stochasticity in environment or agents actions). The control policy that is inferred from
a demonstration may thus perform unsafe or undesirable actions when the initial configuration or
disturbance profile is different [57]. Thus, learning from demonstrations lacks robustness.
2. Quality: Not all demonstrations are equal - some demonstrations are a better indicator of the desired
behavior than others [53], and the quality of a demonstration often depends on the expertise of
2
the user providing the demonstration [99]. There is also lack of metrics to evaluate the quality of
demonstrations on tasks [90, 57].
3. Ambiguity: Demonstrations have no way of explicitly specifying safety conditions for the robot,
and safely providing a demonstration itself takes a lot of skill [57, 99]. Moreover, there may be many
optimal demonstrations, each trying to optimize a particular objective (e.g., user preference).
4. Scarcity: Generally, human demonstrations are expensive to obtain and time-consuming, let alone
optimal ones. Many state-of-the-art LfD techniques infer a reward function by either treating suboptimal demonstrations as outliers in the presence of majority optimal ones or simply discard them.
5. Non-trivial reward design: An important problem to address when designing and training RL
agents is the design of state-based reward functions [109] as a means to incorporate knowledge of
the goal and the environment model. As reward functions are mostly handcrafted and tuned, poorly
designed reward functions can lead to reward hacking and the RL algorithm learning a policy that
produces undesirable or unsafe behaviors or simply to a task that remains incomplete [69, 8]. Generally, manual reward design requires expert knowledge in this domain and is not trivial to recover
rewards [99].
In general, neural network-based learning is data inefficient, and lacks logical reasoning and transparency due to the convoluted network architectures. On the other hand, symbolic AI [102] comprising
rules, logic, graphs, etc., has been previously used mainly (i) for explicit representation of knowledge to
enable reasoning, deduction and manipulation of task information, (ii) to facilitate explainability and transparency of decision-making algorithms, and (iii) to improve data efficiency of such algorithms. However,
in contrast to neural network learning, symbolic AI lacks the ability to process large amounts of unstructured and noisy data, and extract complex patterns in the data. Furthermore, symbolic learning methods
3
are computationally expensive as they typically involve exhaustive search techniques to converge to a
solution.
There has been a recent surge in developments in the area of neurosymbolic AI [63, 44, 16, 78] to
address the limitations of both neural and symbolic learning methods. Neurosymbolic AI is a sub-area
of AI that integrates the mechanisms of neural network-based learning and symbolic reasoning, thereby
inheriting the advantages of its individual components and alleviating their deficiencies. In RL-driven
robotics, high-level task descriptions using logical specifications (rather than manually defined rewards),
and approaches that use task descriptions expressed in temporal logic are gaining popularity [45, 73, 72,
7, 114]. Drawing inspiration from these works and to overcome the limitations of LfD, we develop a
neurosymbolic framework that unifies the LfD paradigm with temporal logic to learn reward functions and
robot control policies. In this framework, the user provides demonstrations and also defines the partial
task specifications in a mathematically precise and unambiguous formal language. Optionally, the user
can also provide (relative) priorities of specifications to which the rewards should be biased. We use the
formalism of Signal Temporal Logic (STL) [77] as the specification language of choice, but our framework is
flexible to allow other kinds of formalism. STL has seen growing interest in it use for designing, reasoning
and verifying cyber-physical system applications [33, 118, 15]. Additionally, we focus on demonstrations
that are obtained via kinesthetic teaching or teleoperation. The demonstrations implicitly convey intended
behaviors of the user and robot, i.e., demonstrations can be interpreted as partial specifications for the
robot behavior, as well as a representation of the partial (possibly sub-optimal or incorrect) control policy.
On the other hand, temporal logic specifications represent high-level mission objectives for the robot, but
do not indicate how to achieve the objectives. They can also encode information about the environment
such as rules to be obeyed by the agent. In general, the STL specifications tell the agent “what to do”,
while demonstrations and the rewards obtained by evaluating the quality of demonstrations tell the agent
“how to do it”. Essentially a formula in STL is evaluated over a temporal behavior of the system (e.g. a
4
multi-dimensional signal consisting of the robot’s position, joint angles, angular velocities, linear velocity
etc.). STL allows Boolean satisfaction semantics: a behavior satisfies a given formula or violates it. A more
useful feature of STL in the context of our work is its quantitative semantics that define how robustly a
signal satisfies a formula or define a signed distance of the signal to the set of signals satisfying the given
formula [29, 38].
Learning accurate cost or reward functions requires certain assumptions about the task and environment (e.g., distance metrics used in the environment space, safe distance for navigation, etc.) [99], and
STL is one of the ways of defining properties of tasks and environments. We use STL specifications for
two distinct purposes: (i) to evaluate and automatically rank demonstrations based on their fitness w.r.t.
the specifications, and (ii) to infer rewards to be used in an RL procedure used to train the control policy.
We remark that STL does not define the entire reward function, but only some parts or sub-spaces of it.
Furthermore, as STL is a class of symbolic AI, synthesizing control policies directly from specifications is
computationally very expensive [27] and hence our framework uses demonstrations to address this issue.
We present a novel way of estimating the quality of a demonstration over a set of specifications by representing the specifications in a directed acyclic graph to encode the relative priorities among them. The
key insight of this work is that the use of even partial STL specifications can help in a mechanism to automatically evaluate and rank demonstrations, thereby inferring rewards to be used in an RL setting and leading to
learning robust control policies. The ultimate objective of this work is to provide a framework for a flexible
structured reward function formulation.
1.1 Contributions and Organization
We first present the theoretical machinery in Chapter 2, that formalizes the mathematical concepts and
definitions used in the remainder of this thesis. Specifically, we establish the tools for modeling the interaction between the learner agent and the environment, and encoding the environment-related tasks.
5
In Chapter 3, we propose a framework for LfD using STL specifications to infer rewards without the
necessity for optimal or perfect demonstrations, that can be used by the robot to find a policy using appropriate off-the-shelf RL algorithms with slight modifications. The framework uses manually-ranked
specifications and evaluates demonstrations based on their robustness w.r.t. those task specifications. A
cumulative robustness is inferred for each demonstration, based on which they are ranked. Incremental
fractional rewards are assigned to the regions of observed states which form the reward function. Finally,
RL is performed on this reward functions to extract a control policy that is consistent with the temporal
logic specifications.
Chapter 4 presents a scalable extension to the LfD-STL framework from Chapter 3 that can automatically infer user performance and preferences via performance graphs. These graphs also model conflicting
task specifications that explain the trade-offs in the demonstrated behaviors. Through a user study, we
show that this new algorithm, PeGLearn, can enhance the transparency of robot behaviors.
As the final theoretic contribution, in Chapter 5, we develop a mechanism that uses the performance
graphs from Chapter 4 as optimization objectives to guide the RL agent to seek behaviors that can extrapolate beyond the demonstrations. This mechanism allows reasoning about the performance of the RL agent
and converge to the optimal reward function and policy.
In all chapters, through numerous diverse sets of robotic simulation experiments, we show that our
methods can learn from only a handful of even suboptimal (imperfect or non-expert) demonstrations and
high-level specifications, while ensuring satisfaction of the logic-based task specifications, thus leveraging
the benefits of neurosymbolic AI. Finally, Chapter 6 summarizes the contributions of this dissertation and
discusses future research directions.
6
Chapter 2
Concepts and Mathematical Representations
In this chapter, we will describe the terminologies and mathematical concepts that will be used in the
remainder of the dissertation.
2.1 Markov Decision Processes
Reward/cost functions form the central concept of Reinforcement Learning (RL). This section will provide
a background of RL and its mathematical definition. An RL agent learns by trial-and-error by continuously
interacting with its surrounding environment, and is assumed to obey the Markov property. The Markov
property is a property of a transition system wherein the future evolution does not depend on its history.
In other words, the next state in the environment only depends on the RL agent’s current state. This
interaction is modeled via Markov Decision Processes.
7
Definition 2.1.1 (Markov Decision Process). A Markov Decision Process (MDP) is a tuple M =
⟨S, A, T(·, ·, ·), R(·)⟩, where
• S ⊂ R
k
is the state space of the system.
• A ⊂ R
l
is the action space of the system; the actions can be discrete or real-valued (continuous), and can be finite or infinite.
• T : S × A × S → [0, 1] is the transition function (a probability distribution). It maps a
transition (st
, at
, st+1) to a probability, i.e., T (st
, at
, st+1) = P r(st+1 | st
, at).
• R is a reward function that typically maps a state s ∈ S, a state-action pair (s, a) ∈ S × A
or a transition (st
, at
, st+1) ∈ S × A × S to R
An MDP has an optional discount factor denoted by γ. In RL, the goal of the learning algorithm is to
find a policy π : S × A → [0, 1] that maximizes the total (discounted) reward from performing actions on
an MDP, i.e., the objective is to maximize P∞
t=0 γ
t
rt
, where rt = R(st) or R(st
, at
, st+1). Throughout this
dissertation, we assume full observation of the state space for agents operating in known environments.
Definition 2.1.2 (Trajectory or Episode Rollout). A trajectory in an MDP M is a sequence of stateaction pairs of finite length L ∈ N by following some policy π from an initial state s0, i.e., a
trajectory τ = ⟨(si
, π(si))⟩
L−1
i=0 , where si ∈ S and ai ∈ A.
In our LfD setting, the demonstrations are collected on the robot itself (e.g., via teleoperation or kinesthetic teaching) and so, the observations are elements of the MDP. Hence, we interchangeably refer to
trajectories or rollouts as demonstrations. For intuition, we use the term demonstration to refer to a rollout provided to the RL agent as input and represent it by ξ.
8
2.2 Signal Temporal Logic
STL is a real-time logic, generally interpreted over a dense-time domain for signals that take values in a
continuous metric space (such as R
m).
Definition 2.2.1 (Discrete-Time Signals). Let T = {t0, t1, . . .} be a finite or infinite set of timepoints, where ∀i, ti ∈ R
≥0
. For a compact set D, a discrete-time signal x is a function from T to D.
In our work, we restrict our attention to sets D that are compact subsets of R
m for some positive
integer m.
For a trajectory, the basic primitive in STL is a signal predicate µ that is a formula of the form f(x(t)) >
0, where x(t) is the tuple (s, a) of the trajectory x at time t, and f is a function from the signal domain
D = (S×A)to R. STL formulas are then defined recursively using Boolean combinations of sub-formulas,
or by applying an interval-restricted temporal operator to a sub-formula. The syntax of STL is formally
defined as in Equation 2.1.
φ ::= µ | ¬φ | φ ∧ φ | GIφ | FIφ | φUIφ (2.1)
Here, I = [a, b] denotes an arbitrary time-interval, where a, b ∈ R
≥0
. The semantics of STL are
defined over a discrete-time signal x defined over some time-domain T. The Boolean satisfaction of a
signal predicate is simply True (⊤) if the predicate is satisfied and False (⊥) if it is not, the semantics
for the propositional logic operators ¬, ∧ (and thus ∨, →) follow the obvious semantics. The temporal
operators model the following behavior:
• At any time t, GI (φ) says that φ must hold for all samples in t + I.
• At any time t, FI (φ) says that φ must hold at least once for samples in t + I.
9
• At any time t, φUIψ says that ψ must hold at some time t
′
in t + I, and in [t, t′
), φ must hold at all
times.
0 1 2 3 4 5
0
1
2
3
4
5 G
⋆
Figure 2.1: Demonstrations in a grid-world.
Example 1. Consider a 6 × 6 grid environment and two demonstrations shown in green (ξg) and yellow
(ξy), in Figure 2.1. Each cell in the grid is represented by a tuple (x, y) indicating its coordinates. The possible
actions in each cell are {↑, ↓, ←, →}. The red (■) cells are regions to be avoided, and a policy is required to
start at (0,1) ■ and end at brown (4,5) ■. Consider the specifications: φ1 := F[0,9](dist(x(t),(4, 5)) < 1),
which indicates that the taxi-cab (Manhattan or L1-norm) distance between the current position and goal is
less than 1, and φ2 := G[0,9](distred(x(t)) ≥ 1) where distred is the taxi-cab distance between a cell and its
nearest red cell. φ1 says “reach the brown cell within 9 time-steps” and φ2 represents the phrase “maintain a
distance of at least 1 from the nearest red cell at all times”, which simply translates to “always avoid the red
cells”. Here we consider the signal x to represent only the states of ξg. We see that φ1 is satisfied since the brown
cell (4, 5) occurs in the trajectory within 9 time-steps. We compute the distred and we see that the trajectory
intersects with red cells and hence φ2 is not satisfied since there exists a time-step at which the cells coincide.
Similarly, we can see that the ξy satisfies both requirements since the goal state occurs in its trajectory and its
distred is always greater than 1.
10
Table 2.1: Quantitative Semantics of STL
φ ρ (φ, x, t)
true/false ⊤/⊥
µ f(x(t))
¬φ −ρ (φ, x, t)
φ1 ∧ φ2 ⊗(ρ (φ1, x, t), ρ (φ2, x, t))
φ1 ∨ φ2 ⊕(ρ (φ1, x, t), ρ (φ2, x, t))
GI (φ) ⊗τ∈t+I (ρ (φ, x, τ ))
FI (φ) ⊕τ∈t+I (ρ (φ, x, τ ))
φUIψ ⊕τ1∈t+I (⊗(ρ (ψ, x, τ1), ⊗τ2∈[t,τ1)
(ρ (φ, x, τ2)))
2.2.1 Quantitative Semantics
Given an algebraic structure (⊕, ⊗, ⊤, ⊥), we define the quantitative semantics for an arbitrary signal x
against an STL formula φ at time t as in Table 2.1. The quantitative semantics of STL define the robustness
function ρ. A signal satisfies an STL formula φ if it is satisfied at time t = 0. Intuitively, the quantitative
semantics of STL represent the numerical distance of “how far” a signal is away from the signal predicate.
For a given requirement φ, a trajectory τ that satisfies it is represented as τ |= φ and one that does not, is
represented as τ ̸|= φ. In addition to the Boolean satisfaction semantics for STL, various researchers have
proposed quantitative semantics for STL, [38, 59, 1, 100, 107, 6] that compute the degree of satisfaction
(or robust satisfaction values) of STL properties by traces generated from a system. In this work, we use
the following interpretations of the STL quantitative semantics: ⊤ = +∞, ⊥ = −∞, and ⊕ = max, and
⊗ = min, as per the original definitions of robust satisfaction proposed in [38, 34].
Using these semantics allows a demonstration that satisfies a specification to have non-negative robustness (score) for that specification, and a demonstration that violates it will have a negative robustness
(score). We use STL in our work because it offers a rich set of quantitative semantics that are suitable for
formal analysis and reasoning of systems. The requirements defined with STL are grounded w.r.t. the actual description of the tasks/objectives. Furthermore, STL allows designers or users to specify constraints
11
that evolve over time and define causal dependencies among tasks. Their semantics allow for the definition
of non-Markovian rewards and accurately evaluating trajectories and policies for RL.
In our setting, a task can consist of multiple specifications. However, the robustness of each specification may lie on different scales. Consider for example, a driving scenario, where one specification concerns
the speed of the vehicle, while another concerns the steering angle. Since the measurement scale of speed
is significantly larger than angle (e.g., 60 mph vs 1.6
◦
), the robustness of the corresponding specifications
also differs significantly. Furthermore, if the maximum robustness a car can achieve is 60 and 1.6 for the
respective specifications, then computing a linear combination involving them would induce bias towards
the speed specification. To avoid this bias, the robustness ranges need to be normalized. Some common
normalization or smoothing techniques are surveyed in [48, 30]. In our work, assuming that the robustness
bounds/limits of the specifications are known or determined beforehand, we use the hyperbolic function tanh
or piece-wise linear functions to bound the robustness values. Specifically, we normalize/scale the robustness values to be bounded in the interval [−∆, ∆], where ∆ can be tuned according to the specification
and environment (task) characteristics.
2.3 Directed Acyclic Graphs
We now introduce the concept of DAGs that will be used widely throughout this dissertation.
Definition 2.3.1 (Directed Acyclic Graph). A directed acyclic graph or DAG is an ordered pair
G = (V, E) where V is a set of elements called vertices or nodes, and E is a set of ordered pairs of
vertices called edges or arcs. An edge e = (u, v) is directed from a vertex u to another vertex v.
A DAG has several main properties:
12
1. Path: A path p(x, y) or x ❀ y, in G is a set of vertices starting from vertex x and ending at vertex
y by following the directed edges from x. For example, in Figure 2.2, φ1 → φ2 → φ5 is a path.
2. Node value: Each vertex v ∈ V is associated with a real number - value of the vertex, represented
by ν(v).
3. Node weight: Each vertex v ∈ V is associated with a second real number - weight of the vertex,
represented by w(v).
4. Edge weight: Each edge (u, v) ∈ E is associated with a real number - weight of the edge and is
represented by w(u, v). Notice the difference in the number of arguments in the notations of vertex
and edge weights.
5. Ancestor: The ancestors of a vertex v is the set of all vertices in G that have a path to v. Formally,
ancestor(v) = {u | p(u, v), u ∈ V }. For example, in Figure 2.2, the ancestors of φ5 is the set
{φ1, φ2}, while the ancestors of φ1 and φ4 is the null set.
w(φ1) = 5 − 0 = 5 φ1 φ2
w(φ2) = 5 − 1 = 4
w(φ3) = 5 − 1 = 4 φ3 φ4 w(φ4) = 5 − 0 = 5
φ5
w(φ5) = 5 − 2 = 3
δ12
δ13
δ25
Figure 2.2: Weights on nodes in a DAG.
13
Chapter 3
Temporal Logic-Guided Reward Inference
In this chapter, we seek to formalize learning reward functions from demonstrations and formal task specifications. In Section 3.1, we formally define the problem of extracting rewards and control policies from
demonstrations and high-level task specifications. From Section 3.2 to Section 3.4, we propose and provide
detailed analysis of a framework for the reward inference procedure that is applicable to both discrete
and continuous-space environments, while accounting for the uncertainties in the state and action spaces.
Finally, we evaluate our framework on experiments ranging from discrete to continuous spaces and also
on deterministic and stochastic environments in Section 3.5.
3.1 LfD-STL Framework
In this section, we will first formalize the problem of inferring rewards from demonstrations. We then
propose a framework that aims to solve this problem and provide detailed analysis of the framework’s
components.
14
Definition 3.1.1 (The Reward-Inference Problem). Given a reward-free MDP M\{R} with possibly unknown transition dynamics, a finite set of high-level task specifications in STL Φ =
{φ1, φ2, · · · , φn} and a finite dataset of demonstrations Ξ = {ξ1, ξ2, · · · , ξm}, where each demonstration is defined as in Def. 2.1.2, the goal is to infer a reward function R for M such that the
resulting robot policy π obtained by a suitable RL algorithm, satisfies all the requirements of Φ
a
.
aThe ideal procedure would involve verification, but we just empirically check for consistencies w.r.t. the specifications.
We propose the LfD-STL framework (Figure 3.1) for learning reward functions from demonstrations
and STL specifications. The high-level objectives or tasks to be performed are first expressed via STL by
conversion from their natural language descriptions. As the tasks may consist of multiple specifications
Φ, the demonstrators prioritize these specifications, discussed in Section 3.2, to obtain a ranking of the
specifications. Demonstrations are collected via teleoperation or kinesthetic teaching to form a dataset Ξ.
The framework thus uses Φ and Ξ to infer a reward function R, which is elaborated in Section 3.3. An
appropriate reinforcement learning algorithm is used on the inferred reward function R to obtain a policy
that is consistent with Φ
a
.
The assumptions we use in this work are that: (i) the task can be achieved, even if suboptimally or
imperfectly, under the given specifications Φ, i.e., at least one demonstration satisfying all requirements in
Φ is provides, (ii) the MDP is fully observable, and (iii) the specifications in Φ are accurately aligned with
the task objectives. Furthermore, we only consider the states of a demonstration as our signal and discard
the actions associated with those states when evaluating a specification.
As we can see, there are 3 main stages of computation in the framework that are briefly described
below and will be thoroughly examined in their respective sections to follow.
1. Specification priorities or rankings: In this stage or module, the various requirements are assigned
weights that correspond to the degree of importance of the corresponding requirement. This helps in
15
Figure 3.1: Ideal framework for integrating LfD and STL to infer reward functions and robot policy.
shaping or biasing the rewards and hence the learned policy towards higher priority specifications.
Specification weights encode task priorities, user preferences, performance and skill-based metrics,
human annotations such as Likert ratings, etc.
2. Reward inference: This module uses the specification weights to learn cumulative rewards for each
demonstration accordingly. The demonstrations are then ranked by their cumulative rewards, which
are later scaled by their ranks. These rank-scaled individual cumulative rewards are then combined
into a single reward function for the agent.
3. Policy learning via RL: The final stage of the framework is to use the inferred reward function to
perform RL using appropriate algorithms to extract control policies that satisfy the specifications.
3.2 Specification Ranking
As indicated earlier, an MDP can consist of several specifications, each defining a sub-goal or task. Demonstrators may have different preferences over the task specifications, i.e., demonstrators may prioritize
certain subgoals over the others while completing the overall task. A motivating example is in driving,
wherein, different drivers aim to reach their destinations while prioritizing particular driving rules. A
safe/passive driver maintains their average speed well below the limit and a greater distance to the lead
16
vehicle, while not changing lanes frequently. An aggressive driver on the other hand, can exceed the
speed limit, overtake/change lanes frequently, etc. While the overall goals of these drivers are the same,
the passive driver’s behavior would likely result in a longer time to reach their destination, compared to
the aggressive driver. Hence, we see that some trade-offs are induced due to conflicting specifications.
This trade-off/preference in turn produces dependencies among the specifications.
In AI, a popular representation of such dependencies is via directed graphs [102], particularly acyclic
ones as defined in Section 2.3. Thus, we make use of directed acyclic graphs to encode specification dependencies. The specification set Φ is represented by a DAG G = (V, E), wherein, each node v ∈ V
corresponds to a task specification φ ∈ Φ. Each directed edge indicates that the origin/head node is prioritized over the end/tail node of that edge. In the field of formal methods and verification, specifications
are typically categorized into two types:
1. Hard specifications, which correspond to the safety and mission-critical aspects of the main task.
The specifications are usually of the form G(φ) and require the system (i.e., learner agent) to satisfy
them at all times. Hard specifications correspond to the safety properties of the system. Examples of
this are: a robot should always operate/remain within its operational workspace, the joint velocities of a
robot must always be within a specific range [va, vb], etc. Certain bounded liveness properties of the
task can also fall into this category, which indicate that the agent must eventually keep performing
‘good’ behaviors. A patrolling robot that periodically keeps visiting a region is an example of the
liveness requirement.
2. Soft specifications usually correspond to the optimality attributes of a system such as performance,
efficiency, etc. For a goal-reaching task, reaching the goal in the fewest steps (least time) is an
example of a soft specification.
17
Typically, hard specifications are required to be prioritized over the soft specifications. Thus, we require a weighting scheme that captures this priority/dependency. In this chapter, we assume that a DAG
representing the specifications is provided beforehand. The priorities can be obtained via surveys, questionnaires, etc. In the next chapter (Chapter 4), we relax this assumption and automatically infer the DAG
directly from demonstrations. Given the STL specifications in the set Φ, we now express specification priorities as weights. We denote the hard specifications as ΦH ⊂ Φ and soft specifications as ΦS = Φ\ΦH.
The ordering/sorting of specifications is not strict, as it is just for notational convenience. The DAG representing Φ typically has a path from every node in ΦH to every node in ΦS. The weight on each node in
G is computed using Equation 3.1 and an example is shown in Figure 2.2.
w(φ) = |Φ| − |ancestor(φ)| (3.1)
This equation represents the relative importance of each specification based on the number of dependencies that need to be satisfied. These computed weights are normalized via linear scaling (e.g., norm) or
exponential function such as softmax to give higher importance to “harder” specifications. We thus obtain
a weight vector for all specifications Φ as wΦ = [w(φ1), w(φ2), · · · , w(φn)]T
. For a specification φ ∈ Φ
and a demonstration ξ ∈ Ξ defined as in Def. 2.1.2, the value ρ(φ, ξ, t) represents how well the demonstration satisfied the given specification from time t, which is the quality of the demonstration. To evaluate the
entire trajectory, the robustness is defined at t = 0, i.e. ρ(φ, ξ, 0) and is implicitly denoted by ρ(φ, ξ). For
a demonstration ξ, we have an array of evaluations over Φ, given by ρˆξ = [ρ(φ1, ξ), · · · , ρ(φn, ξ)]T
. Each
demonstration is then assigned a cumulative robustness/fitness value rξ based on these weights, given
Equation 3.2.
rξ =
Xn
i=1
w(φi) · ρ(φn, ξ) =⇒ rξ = w
T
Φ · ρˆξ (3.2)
Note that the robustness values are normalized as discussed in Section 2.2, making it appropriate to
linearly combine robustness values of specifications since they are on similar scales.
18
3.3 Reward Inference Mechanism
We now require a mechanism that can combine each demonstration’s cumulative robustness/fitness into
a global reward function for the learner agent. To proceed, we first categorize demonstrations based on
whether they satisfy or violate the specification set Φ. Thus, we define the two categories of demonstrations as follows.
Definition 3.3.1 (Good Demonstrations). A demonstration is labeled good if it satisfies the specifications Φ = ΦH ∪ ΦS. Every state-action pair occurring in the sequence satisfies all STL requirements. A good demonstration, denoted by ξ
+, does not have to be optimal (i.e., have high
robustness values) for each specification; rather just that it should not violate any specification.
Definition 3.3.2 (Bad Demonstrations). A demonstration is considered bad if it violates any hard
specification of ΦH. A bad demonstration, denoted by ξ
−, consists of at least one state or stateaction pair that violates a hard specification ψ, i.e., ∃j such that (sj , aj ) ̸|= ψ.
The reward assignment process is adapted to the type of demonstration. Since demonstrations can
be noisy, the reward inference algorithm needs to account for this noise/stochasticity in the environment dynamics, thus producing stochastic reward functions. Rationally, one would expect an agent to
perform a given task correctly by following the good demonstrations and hence the rewards would be
based on such demonstrations. We first construct the DAG-based arrangement of specifications, obtain
the weights/priorities and also the cumulative robustness for demonstrations as described in Section 3.2.
Given a demonstration ξ = ⟨si
, ai⟩
L
i=1 and its final DAG robustness rξ, we derive a procedure to estimate the “true” reward rˆξ of the demonstration as if the transitions were deterministic. In other words,
rˆξ = rξ is the reward that the agent would maximize if it were in a deterministic environment. When the
environment is stochastic, rˆξ should increase along the demonstrations to prevent the agent from moving
19
away from the states observed in such demonstrations, i.e., the rewards for a demonstration behave as
attractors because they persuade the agent to follow the good demonstration as much as possible. Hence,
as the environment uncertainty increases, rˆξ also increases. Here, we consider the states and actions as
they are observed in a demonstration ξ. The agent starts in state s1 and executes the corresponding action
a1 as seen in ξ. Assuming Markovian nature of the environment’s stochastic dynamics, for subsequent
state-action tuples in ξ we have,
P r(sL | τ = (si)
L−1
i=1 , aL−1) =
L
Y−1
l=1
P r(sl+1 | sl
, al) (3.3)
where each ai
is the action indicated in the demonstration and τ is the (partial) trajectory/demonstration
till a particular state1
. Hence, the true reward rˆξ can now be expressed as follows:
P r(sL | τ = (si)
L−1
i=1 , aL−1) · rˆξ = rξ
rˆξ =
rξ
L
Q−1
i=1
P r(sl+1 ∈ ξ | (sl
, al) ∈ ξ)
(3.4)
This equation reflects that rˆξ increases as uncertainty increases, i.e., as P r(s
′
| s, a) → 0, in the
environment. In order to account for the stochasticity, we define Z(s, as) as the set of all states that
are reachable from a given state s in one step (since it is an MDP) by performing all actions other than
its corresponding action as appearing in a demonstration. We first present the reward assignments for
environments with discrete spaces and later discuss its extension to more realistic (continuous) spaces.
3.3.1 Reward Assignments in Discrete Environments
In the discrete spaces, the states and actions are unique, and the rewards for all states are initially assigned
to 0. The reward assignment to observations varies according to the type of demonstration under consideration. We first discuss the extraction of local rewards which correspond to the reward function for
1
In the above derivation, τ = (si)implicitly means τ = (si, ai)since it represents a rollout (demonstration). The ai notation
was dropped for simplicity
20
individual demonstrations. Later, we discuss how to combine each demonstration’s local reward into a
global reward function for the RL agent.
Good Demonstrations For good demonstrations, the reward is assigned to every state in the demonstration. For all state-action pairs occurring in a demonstration ξ
+, rξ+ (sl) describes the reward assigned
to state sl ∈ ξ
+. The reward function is given by Equation 3.5.
rξ+ (sl) = P r(sl
| sl−1, al−1) ·
l
L
· rˆξ+
∀ sl−1, sl
, al−1 ∈ rˆξ+
rξ+ (s
′
) = P r(s
′
| sl−1, a) · rξ+ (sl)
s
′ ∈ Z(sl−1, al−1) − {sl}; a ∈ A\{al−1}
(3.5)
where l ∈ [1, L]. When, l = 1 (initial or base case), P r(s1 | s0, a0) represents the probability of the
agent starting in the same state as the demonstrations and (s0, a0)is introduced for notational convenience.
All other states are assigned a reward of zero. Good demonstrations have strictly non-negative rewards
as they obey all specifications. The rewards in such demonstrations behave as attractors or potential fields
to persuade the agent to follow the good demonstrations as much as possible. The shape of this reward
function resembles a Gaussian distribution.
Bad Demonstrations Logically for bad demonstrations, instead of assigning rewards to each state of the
demonstration, the reward is only assigned to the states or state-action pairs violating the specifications.
Intuitively, it penalizes the bad states while ignoring the others since the good states may be part of another
demonstration or the learned robot policy that satisfies all requirements. A bad demonstration ξ
− will have
negative robustness/fitness values and hence a negative cumulative reward that is amplified as the true
reward rˆξ as per Equation 3.4. Let sbad ∈ ξ
− be the states at which a violation of a hard specification ψ
occurs, i.e., sbad = {sj | (sj , aj ) ̸|= ψ}, then the reward assignment is as shown in Equation 3.6.
21
rξ− (sl) = P r(sl
| sl−1, al−1) · rˆξ− , if sl ∈ sbad
rξ− (s
′
) = P r(s
′
| sl−1, a) · rξ− (sl),
s
′ ∈ Z(sl−1, al−1); a ∈ A\{al−1}
(3.6)
The rewards in such demonstrations behave as repellers to deflect the agent from bad states. Once
again, all other states are assigned a reward of zero. The shape of this reward function resembles an
inverted Gaussian distribution around the bad states.
We now show that the reward assignment mechanism is consistent with the temporal logic task specifications. For a demonstration ξ, the induced reward rξ(s) is the reward induced by demonstration ξ for
any state s ∈ S, computed via Equation 3.5 and Equation 3.6. Let rξ(s
′
) ≺ rξ(s) denote that rξ(s) < rξ(s
′
)
if ξ is a bad demonstration and rξ(s) > rξ(s
′
) if ξ is a good demonstration, for s ∈ ξ and s
′ ∈/ ξ.
Lemma 3.3.1. For any demonstration ξ, ∀sl ∈ ξ, rξ(s
′
) ≺ rξ(sl).
Proof Sketch. The sum of transition probabilities in a state over all actions is 1. Hence, the product of 2
of these probabilities (as for rξ(s
′
) in Equation 3.5 and Equation 3.6) is less than either of them and is a
positive quantity. Therefore, in good demonstrations, the neighbor states s
′ have lower rewards than the
observed state sl
, thereby influencing the agent to not prefer states not seen in good demonstrations and
also that there is a possibility that the neighbors are bad states. For bad demonstrations, the neighbors s
′
have higher rewards than the bad states and are still negative in value, which influence the agent to move
away from bad states and also that there is a chance these neighbors could be good states.
3.3.1.1 Special Cases
In this section, we derive the reward formulation for deterministic and stochastic environments used in
the experimental evaluations.
22
Case 1: Deterministic In the case of deterministic transitions, the agent follows the selected action (i.e.,
P r(s
′
| s, a) = 1) while all other actions have probability of 0. As a result, the probability of transitioning
to the neighbor states in 1-step via the other actions is 0. By Equation 3.3, rˆξ = rξ. The rewards for each
type of demonstration are as follows:
• Good demonstration:
rξ+ (sl) = l
L
· rˆξ+ ,
∀sl
, al ∈ ξ
+; l ∈ [1, L]
(3.7)
• Bad demonstration:
rξ− (sl) =
rˆξ− , if sl ∈ sbad
0, otherwise
(3.8)
Case 2: Uniform stochasticity Let p ∈ [0, 1) denote the uncertainty of the environment: the agent
follows or executes a selected action a ∈ A with probability P r(s
′
| s, a) = 1 − p and due to uncertainty,
randomly follows/chooses one of the remaining N = |A \ {a}| actions uniformly, i.e., with probability
p/(N −1). The sum of probabilities of all transitions or actions is 1. Thus, for a demonstration ξ, the agent
follows ξ with probability (1 − p)
L−1
, by Equation 3.3. Substituting this in Equation 3.4, the true reward
is:
rˆξ =
rξ
(1 − p)
L−1
(3.9)
With regard to the “attractor-repeller” intuition of rewards stated earlier, as the uncertainty p increases,
rˆξ also increases, influencing the agent to follow along the demonstrations. For each type of demonstration,
the rewards are described below:
23
Good demonstrations
rξ+ (sl) = (1 − p) ·
l
L
· rˆξ+ ; ∀sl ∈ ξ
+
rξ+ (s
′
) = p
N − 1
· (1 − p) ·
l
L
· rˆξ+ ,
s
′ ∈ Z(sl−1, al−1) − {sl}; a ∈ A − {al−1}
(3.10)
where l ∈ [1, L]. For the initial state, P r(s1 | s0, a0) could be 1 − p or simply 1 if the agent is known to
always start from that state. From the above equations and Lemma 3.3.1, rξ+ (s
′
) is guaranteed to be lower
than rξ+ (sl) since 0 < p < 1 =⇒ 0 < p/(N − 1) < 1/(N − 1) < 1. By applying simple inequality
rules, we can show that (1 − p) · p/(N − 1) < (1 − p), which is the guarantee that reward is propagated
in a decreasing manner to neighboring states not seen in the demonstrations.
Bad demonstrations
rξ− (sl) = (1 − p) · rˆξ− ; if sl ∈ sbad
rξ− (s
′
) = p
N − 1
· (1 − p) · rˆξ− ,
s
′ ∈ Z(sl−1, al−1); a ∈ A − {al−1}
(3.11)
Similar guarantee for reachable states holds here as well. The rewards in all other states are zero. We use
this model for all our stochastic discrete environment experiments.
The case of p = 1 or as p → 1 represents that the environment is completely non-deterministic or
adversarial (i.e., the agent never transitions to the desired state or performs the action chosen). In this case,
by computing the limits, we can see that the rewards for all states tend to either +∞ in the case of good
or to −∞ in the case of bad demonstrations. However, such scenarios are unrealistic in our setting since
we assume that the intent of demonstrations is to teach the agent to learn rational behaviors and achieve
the task. In the hypothetical case of such scenarios, the demonstrator may adapt and provide adversarial
actions so that the agent performs the originally intended behavior. While we plan to investigate such
adversary-influenced demonstrations for future work, it is currently beyond the scope of this thesis.
24
We emphasize that our approach is generic to any P r(s
′
| s, a) ∈ [0, 1) and non-uniformity in transition probabilities. The description of all these cases shows that our reward mechanism is complete for
stochastic environments and non-adversarial agents. The probabilistic rewarding scheme described above
can possibly assign positive rewards in the case of good demonstrations (and negative in bad demonstrations) to bad (and good, respectively) states, leading to a reward discrepancy. However, this is compensated for when the temporal logic-guided RL algorithm (described in Section 3.4) uses the robustness of
the partially learned policy w.r.t. hard specifications while learning, to detect and rectify any violations.
Alternatively, providing more demonstrations would also overcome discrepancies in rewards, but are not
required.
3.3.1.2 Global Reward Function
So far, we have described the method to extract local rewards (i.e., reward function for each demonstration
independently). We now require a mechanism to combine all local rewards succinctly into a single function
on which RL can be performed. Let Rξ represent the reward induced by a demonstration ξ for all states
in the environment state space. Once the states in each demonstration have been assigned rewards, the
next objective is to rank the demonstrations and combine all the rewards from the demonstrations into
a cumulative reward that the learner agent will use for finding the desired policy. The demonstrations
are sorted by their cumulative fitness or robustness values to obtain rankings. The learner reward is
initialized to zero for all the states in the environment. The resulting reward for the learner is given
as Pm
j=1 rank(ξj) · Rξj
and then normalized. This equation affects only the states that appear around
the demonstrations and the intuition here is that preference is given to higher-ranked demonstrations.
By the definition of robustness and its use in reward inferences, it is important to note that “better” the
demonstration, higher the reward. In other words, the rewards are non-decreasing as we move from bad
demonstrations to good demonstrations. Hence, good demonstrations will strictly have higher reward
25
values and are ranked higher than bad demonstrations. Additionally, this reward can be inferred for one
environment and directly transferred to another similar environment.
3.3.2 Reward Assignments in Continuous Environments
For continuous state-spaces, defining rewards for states only encountered in a demonstration is very restrictive and due to the continuous nature of the state and/or action spaces, and numerical accuracy errors,
the observed demonstrations will very rarely have the same state and/or action values. Additionally, providing demonstrations in this space is already subject to uncertainties. For continuous spaces, we first
compute the demonstration rewards from DAG-specifications and assign rewards to the demonstration
states as described in the previous section. We then rank the demonstrations and scale the assigned rewards by the corresponding demonstration ranks. The next step is to show how rewards from different
demonstrations are generalized and combined over the state space. Since the states in the demonstrations
are rarely the same, simply performing a rank-based weighted sum of state rewards, as in the discrete
case, would be tedious due to the large state space. To address this, we collect the rank-scaled state rewards in a dataset and perform regression. For each demonstration, we have a collection of tuples in the
form of (state, reward) or (state, action, reward), and we can then parameterize the rewards as r(s, θ)
or r(s, a, θ) respectively. Finally, we organize these points in a dataset that is used to learn a function
approximation fθ : S → R or fθ : S × A → R. Function approximations can be learned via common
regression techniques like Gaussian Processes, neural networks (NN) such as feed-forward deep NN, convolutional NN, etc., that take as input, the features of a state or state-action pair and output a single/scalar
reward. This method can also be used in large discrete state spaces since the rewards can be sparse and
many states have a reward of zero by default.
From the reward definitions in the previous sections, it is fairly straightforward to extend the statebased rewards to state-action-based rewards as in the case of stochastic policies with discrete action space.
26
For discrete actions, it is straight-forward to compute the reachable set. But for continuous actions, in order
to compute the reachable set from a given observed state with bounded time and actions, we model each
observed state using a (multi-variate) Gaussian distribution and then generate samples around it. These
samples correspond to the reachable set, and we can compute the probability of each sample belonging to
that distribution via M ahalanobis distance [28], which gives us the transition probabilities. Specifically,
instead of using each of the tuples in their raw form as training data, we represent them as samples of
(multi-variate) Gaussian distribution with mean s or (s, a) and having a scaled identity covariance matrix
representing the noise in the observations. We then generate a fixed number of samples from the distribution of each observed state to represent the reachable set. For each of the generated samples, we can
estimate the probability of that sample belonging to the distribution of the observed state, which is the
transition function that is used to assign rewards as described earlier.
3.4 Learning Policies from Inferred Rewards
In order to learn a policy from the inferred rewards, we can use any of the existing relevant RL algorithms
with just 2 modifications to the algorithm to accelerate training: (i) reward observation step: during each
step of an episode, we record the candidate trajectory of the agent (i.e., the current trajectory rolled out
so far) and evaluate it w.r.t. all the hard specifications ΦH. The sum of the robustness values of the
candidate trajectory for each hard specification is added to the observed reward. This behaves similar to
potential-based reward shaping [85]. In the case when a bad demonstration is ranked higher than a good
demonstration, the algorithm also takes this into account and compensates for the misranking in this step,
and (ii) episode termination step/condition: in addition to the environment-based termination (e.g., goal
reached), we also terminate the episode when the candidate trajectory violates any hard specification.
These two modifications lead to faster and possibly safer learning/exploration. This is especially helpful
27
when agents interact with the environment to learn and the cost of learning unsafe behaviors is high (e.g.,
the robot can get damaged, or may harm humans).
This modified specification-guided RL algorithm, denoted as RLSTL can be extended to MDPs with multiple sequential objectives. RLSTL incorporates RL with STL monitoring-in-the-loop for safer exploration
and learning from imperfect demonstrations. In order to learn a policy for multiple objectives, consider a
set of goal states Goals = {g1, g2, · · · , gk} where k is the number of objectives or goals. Some specifications can require the robot to achieve the goals in a particular sequential order while others may require
the robot to achieve goals without any preference to order. In the case of arbitrary ordering, the number of ways to achieve this is k!, hence all the permutations of the goals are stored in a set. For each
permutation or ordering2 of the goals oˆ = ⟨g1, g2, · · · , gk⟩, a policy is extracted that follows the order:
πoˆ : start RLSTL −−−→ g1
RLSTL −−−→ g2
RLSTL −−−→ · · ·
RLSTL −−−→ gk. Each of the final concatenated policies πoˆ is recorded and
stored in a dataset represented by Π. At this stage, the policies in Π all satisfy the hard requirements ΦH
and hence all are valid/feasible trajectories. Finally, the policy that results in maximum robustness w.r.t.
the soft requirements ΦS is chosen, which imitates the user preferences. This procedure is formalized in
Algorithm 1.
3.5 Experiments
For our experiments, we first validate our framework on discrete and deterministic environments, and
later move on to stochastic and continuous domains.
3.5.1 Discrete-Space Environments
Single-Goal Grid-World We created a grid-world environment E consisting of a set of states S =
{start, goals, obstacles}. Various map sizes were used, ranging from 4 × 4 to 15 × 15; the obstacles were
2
Partial ordering helps reduce complexity. In the case of particular ordering, this step can be replaced by the desired order
and the complexity reduces from k! to 1.
28
Algorithm 1: Learning multi-objective robot policy from inferred rewards
Input: Ξ := set of demonstrations
Φ := set of specifications
Result: Learns multi-objective robot policy from inferred rewards
1 begin
2 O ← P ermutationSet(Goals) // generates permutation of all goal or
objective states.
3 R ← reward function from LfD-STL
4 Π ← ∅
5 for oˆ ∈ O do // oˆ is an ordering of goals ⟨g1, g2, ..., gk⟩
6 πoˆ ← Run RLSTL using (R, Init, g1)
7 for i ← 1 to |Goals| − 1 do
8 πoˆ ← πoˆ+ Run RLSTL uing (R, gi
, gi+1)
9 Π ← Π ∪ πoˆ
// The resulting policies satisfy all hard requirements
10 π
∗ ← argmaxπ∈Π
P
φ∈ΦS
ρ(φ, π, t) // Policy that maximizes robustness of all
soft-requirements
assigned randomly. The distance metric used for this environment is Manhattan distance (i.e., L
1 norm)
and the STL specifications for this task are defined as follows:
1. Avoid obstacles at all times (hard requirement):
φ1 := G[0,T]
(dobs[t] ≥ 1), where T is the length of a demonstration and dobs is the minimum distance of robot from obstacles computed at each step t.
2. Eventually, the robot reaches the goal state (soft requirement):
φ2 := F[0,T]
(dgoal[t] < 1), where dgoal is the distance of robot from goal computed at each step.
φ2 depends on φ1.
3. Reach the goal as fast as possible (soft requirement):
φ3 := F[0,T]
(t ≤ Tgoal), where Tgoal is the upper bound of time required to each the goal, which
is computed by running breadth-first search algorithm from start to goal state, since the shortest
policy must take at least Tgoal to reach the goal. φ3 depends on both φ1 and φ2 in the DAG.
29
(a) Grid-world game setup (b) Example user demonstration
(in green)
Figure 3.2: P yGame user-interface.
STL specifications are defined and evaluated using a MATLAB toolbox - Breach [32]. A grid-world
point-and-click game was created using P yGame package that showed the locations of start, obstacles
and goals. The users provide demonstrations in the P yGame GUI by clicking on their desired states with
the task to reach the goal state from start without hitting any obstacles. A screenshot of the grid-world
created using P yGame package is shown in Figure 3.2 along with a sample demonstration in green. The
task is to select or click on cells starting from the dark blue cell (bottom-left) and ending in the light blue
cell (top-right). The red cells represent “avoid” regions or obstacles. Due to the stochasticity, unaware to
the users, their clicked state may not always end up at the desired location. The user then proceeds to click
from that unintended state till they quit or reach the goal.
Deterministic Transitions This is the case when p = 0. For the 5 × 5 map, we used m = 2
demonstrations (1 good and 1 bad) from a single user. The demonstrations and resulting robot policy
are shown in Figure 3.3. In the good demonstration, the reward is assigned to every state appearing
in the demonstration while other states are kept at zero. The rewards increase from start state to the
goal so as to guide the robot towards the goal. In the bad demonstration, one of the states coincides
with an obstacle and only that state is penalized. The final robot reward is a linear combination of the
30
demonstration rewards. The blue heatmap figures represent the rewards learned from the demonstrations
(darker colors represent higher rewards). Since hitting a red obstacle is penalized heavily by the hard
requirement compared to other states, the rewards in the other safe states and goal state appear similar
in value due to the scaling difference. For grid sizes 7 × 7 and 10 × 10, similar results were observed
and each grid had m = 4 demonstrations (2 good, 1 bad and 1 incomplete). The RL algorithm used for
this environment is Q-Learning[109]. The number of episodes used for training ranged from 3000 to 10000
depending on the complexity (grid size, number and locations of obstacles) of the grid-world. The discount
factor γ was set to 0.99 and ϵ-greedy strategy for actions was used with ϵ = 0.4. The learning rate used in
the experiments is α = 0.1.
Stochastic Dynamics This is the case when 0 < p < 1. Four (m = 4) demonstrations from a
single user were collected, of which 2 are shown in Figure 3.4 along with the resulting robot policy under
20% stochastic environment. We obtained similar results for the other larger grid sizes considered in
this experiment. We used Double Q-Learning [51], which is appropriate for stochastic settings, with the
modifications to the algorithm at 2 steps (reward update and termination) as described in Section 3.4. The
number of episodes varied according to the environment complexity (grid size, number and locations of
obstacles) of the grid-world. The discount factor γ was set to 0.8 and ϵ-greedy strategy with decaying ϵ
for actions was used. A learning rate of α = 0.1 was found to work reasonably well. Over N = 100 trials,
with 20% environment uncertainty, the policy was found to reach the goal on average about 81% of the
time with the learned rewards.
OpenAI Gym FrozenLake The proposed method was tested on the OpenAI Gym [19] Frozenlake environment with both 4 × 4 and 8 × 8 grid sizes. We generated m = 4 demonstrations by solving the
environment using Q-Learning with different hyperparameters to generate different policies. We also
modified the F rozenLake grid to relocate the holes, while the goal location remained the same. The
31
(a) Demo 1 (good) (b) Demo 2 (bad) (c) Robot Policy
(d) Ground Truth Reward (e) MCE-IRL Reward
Figure 3.3: Results for 5 × 5 grid-world. (a)-(c): Left figures represent learned rewards. Right figures
show the grid-world with start state (light blue), goal (dark blue), obstacles (red) and demonstration/policy
(green). (e) MCE-IRL rewards with over 40 optimal demonstrations.
(a) Demo 1: Optimal and good (b) Demo 2: Bad
(c) Learned reward and robot policy
(d) Ground-truth reward function (e) MCE-IRL with 300 demos
Figure 3.4: Results for 5 × 5 grid with 20% stochasticity: Inferred rewards are shown in left figures. Right
figures show the grid-world with start state (light blue), goal (dark blue), obstacles (red) and demonstration/policy (green). (e) shows the rewards extracted by MCE-IRL using 300 optimal demonstrations.
32
specifications used are similar to the single-goal grid-world experiment and are direct representations of
the problem statement. The results in Figure 3.5 show the robot policy in which demonstrations were provided on one map, but the agent had to use that information and explore on an unseen map. Left figures of
each sub-figure represent learned rewards. Right figures show the grid-world with start state (light blue),
goal (dark blue), obstacles or holes (red) and demonstration/policy (green). The robot is finally tested on a
different map. The figures in Figure 3.8 show comparisons in the exploration space between our method
and standard Q-Learning with hand-crafted rewards for 4 × 4 grid.
(a) Demo 1 (b) Demo 2 (c) Robot Policy
Figure 3.5: Results for 4 × 4 FrozenLake on train and test maps.
Similar results were obtained in the 8 × 8 grid size Frozenlake (see Figure 3.6). A total of 5 demonstrations (4 good and 1 incomplete) were provided on a particular map. The agent then had to explore and
learn a policy on 3 different maps using only the rewards from the map on which demonstrations were
provided. The obstacles were moved about in each of the test/unseen maps and we see that the agent was
able to successfully learn a policy to reach the goal. Comparisons for the 8 × 8 grid version are shown
in Figure 3.9a and Figure 3.9b and we see that our method is able to narrow-down the search exploration
space under the same hyperparameter settings.
OpenAI Gym Mountain Car (discretized) We used the discretized version of the Mountain Car environment. We first abstracted the continuous observation space into 50×50 grid sizes and generated m = 2
optimal demonstrations based on a Q-Learning algorithm with preset hyperparameters (Figure 3.7). We
used only one requirement based on the problem definition:
33
(a) Demo 1 (b) Demo 2
(c) Robot Policy on test map 1 (d) Robot Policy on test map 2 (e) Robot Policy on test map 3
Figure 3.6: Results for 8 × 8 FrozenLake. Left subfigures represent the reward and the right subfigures
show the environment and policy.
φ := F[0,T]
(dflag[t] ≤ 0), where df lag is the Manhattan distance between the car and the goal flag position at time t. The comparison with Q-Learning for hand-crafted rewards is summarized in Figure 3.9c.
Though there is more variance in the average steps involving our method, we observe that the worst-case
average of our algorithm is still better than the best-case average of standard RL. Other grid sizes used for
the experiments were 75 × 75 and 100 × 100 to show the scalability of our approach. In all the OpenAI
Gym environments, we compared our method to standard Q-Learning with hand-crafted rewards, based
on the number of exploration steps performed by the algorithm in each training episode.
(a) Demo 1 (b) Demo 2 (c) Final robot reward
Figure 3.7: (a) and (b): The left figures represent the simulator reward (1 at goal and 0 elsewhere) while
the right figures show the rewards based on STL specification. (c) Rewards inferred from demonstrations.
Note: In all the figures, the axes represent the cell numbers corresponding to the grid size.
34
(a) Standard Q-Learning 4 × 4 grid (b) LfD+STL 4 × 4 grid
Figure 3.8: Statistics indicating the exploration rate of each algorithm as well as rewards accumulated in
each training episode.
(a) Standard Q-Learning (b) LfD+STL (c) Mountain Car
Figure 3.9: Comparisons of LfD+STL with hand-crafted rewards+Q-Learning for OpenAI Gym environments. (a) and (b) pertain to Frozenlake and (c) pertains to Mountain Car.
Multi-Goal Grid-World In this setup, we created a grid-world having k = 2 goals in a deterministic
environment. The specifications used are as follows:
1. Avoid obstacles at all times (hard requirement):
φ1 := G[0,T]
(dobs[t] ≥ 1), where dobs is the minimum distance of robot from obstacles computed
at time-step t.
2. Eventually, the robot reaches both goal states in any order (soft requirement):
φ2 := F[0,T]
(dgoal1
[t] < 1) ∧ F[0,T]
(dgoal2
[t] < 1).
φ2 depends on φ1.
35
(a) Demo 1 (b) Demo 2 (c) Learned reward and policy
(d) Ground-truth reward function (e) MCE-IRL reward
Figure 3.10: Results for 7 × 7 sequential goal grid-world.
3. Reach the goals as fast as possible (soft requirement):
φ3 := F[0,T]
(t ≤ TG), similar to the single-goal grid-world experiment. φ3 depends on both φ1
and φ2.
For the 5 × 5 grid, a total of m = 3 demonstrations were provided (2 good and 1 bad) and for the 7 × 7
grid, only m = 2 good, but sub-optimal demonstrations were provided using similar hyperparameter settings are indicated earlier. The plots in Figure 3.10 show the demonstration and learned robot policy for
the multi-goal 7 × 7 grid-world. Left figures in each sub-figure represent learned/inferred rewards. Right
figures show the grid-world with start state (light blue), goal (dark blue), obstacles (red) and demonstration/policy (green). There are two goals and the rewards are inferred accordingly. At the next step, the
algorithm enumerates all possible policies: (a) start → goal1 → goal2 and (b) start → goal2 → goal1.
The final policy is a hybrid of the demonstrations while trying to minimize the time (soft requirement). In
this case, it infers a policy start → goal1 (top-right) → goal2 (bottom-right).
3.5.2 Continuous-Space Environments
We used a simple car kinematic model that is governed by the following equations:
36
Figure 3.11: The car always starts in the top-left corner and the task is to navigate to the goal (in yellow)
while possibly avoiding potholes or obstacles (purple). A sample demonstration is shown by the green
trajectory. A possible ground truth reward function is +10 anywhere in the yellow region and -5 in the
purple region.
x˙ = v · cos(θ) + N (0, σ2
); ˙y = v · sin(θ) + N (0, σ2
)
v˙ = u1 · u2;
˙θ = v · tan(ψ); ψ˙ = u3
where, x and y represent the Euclidean coordinates of the car; θ is the heading; v is the velocity;
u1 is the input acceleration; u2 is the gear indicating forward (+1) or backward (-1); u3 is the input to
steering angle ψ. At any time instant t, the state of the car is given by St = [x, y, θ, v, x,˙ y,˙
˙θ, v˙]
T
. Users
can control the car using either an analog Logitech G29 steering with pedal controller or via keyboard
inputs. Alternatively, one could also use a similar setup for mobile robots using respective kinematics and
a joystick controller for acute turns.
The driving layout with goal and obstacle areas, and a sample demonstration is shown in Figure 3.11.
The task is to drive from the top-left corner to the center of the goal while avoiding any hindrances (obstacles, potholes, etc.), denoted by H. As in any driving scenario, the car must maintain a safe distance dSafe
from H and drive on the road/drivable surface. We collected 8 demonstrations (6 good and 2 bad) using a
mixture of analog and keyboard inputs; one of the bad demonstrations passed through the pothole while
another drove off the “road”. The distance metric used in this space is Euclidean. The specifications for
this scenario are as follows:
37
1. Avoid obstacles at all times (hard requirement):
φ1 := G[0,T]
(dobs[t] ≥ dSafe), where T is the length of a demonstration and dobs is the minimum
distance of the car from H computed at each step t. For our experiments, we used dSafe = 3 units.
2. Always stay within the workspace/drivable region (hard requirement):
φ2 := G[0,T]
((x, y) ∈ Box(30, 25)), where the workspace is defined by a rectangle of dimensions
30 × 25 square units.
3. Eventually, the robot reaches the goal state (soft requirement): φ3 := F[0,T]
(dgoal[t] < δ), where
dgoal is the distance between centers of car and goal computed at each step t and δ is a small tolerance
when the center of the car is “close enough” to the goal’s center. φ3 depends on φ1 and φ2 in the
DAG.
4. Reach the goal as fast as possible (soft requirement):
φ4 := F[0,T]
(t ≤ TG), where TG is an estimate of the timesteps to reach the goal based on the
average length of demonstrations.
The collected trajectories along with their robustness for each STL specification and also for the time
taken by them to reach the goal are shown in Figure 3.12. One of the bad demonstrations scraped the
avoid-region and is shown in red in Figure 3.12a. All demonstrations reached the goal and are shown in
blue in Figure 3.12b. In Figure 3.12c, 4 out of 8 demonstrations were slow to reach the goal compared
to the others. One bad demonstration veered “off track” and is shown in red in Figure 3.12d. All these
individual robustness values are combined via the specification-based DAG for each trajectory and rewards
are assigned by modeling the states s as samples of multi-variate Gaussian distribution N (µ, σ2
I) where
µ = s and σ represents the deviations in noise levels. For each s, we generated 20 samples to represent the
reachable set and assigned stochastic rewards as described in Chapter 3. The final rewards are shown in
Figure 3.12e for low noise with σ = 0.03. In the continuous spaces, we can observe that there is a large area
38
(a) Safety robustness (b) Goal-reach robustness (c) Time
(d) Stay within workspace (e) Cumulative reward (f) Reward approximations
Figure 3.12: (a)-(d): Robustness of each demonstration w.r.t. the STL specifications. Blue trajectories indicate positive robustness and red indicate negative. (e): Final rewards based on cumulative robustness
and demonstration ranking. (f): Reward approximation using neural networks. The yellow-shaded region
represents the workspace of the agent, i.e., it is not allowed to leave that region.
of states with reward 0 (in white) which may not be particularly helpful since the agent rarely encounters
the same state as seen in demonstrations due to the noise. To overcome this issue, we approximated the
rewards over the state space using neural network regression. We thus combine the precision of Gaussian
distributions for reward inference and the scalability of neural networks for predictions.
Remark. In all experiments, the reward plots were normalized and the maximum reward was capped to a
sufficiently large value Rmax for the sake of practical/numerical implementation and visualization simplicity.
For the driving experiment, the state space has higher dimensions which becomes too convoluted to visualize.
Instead, to show how the neural network regression would perform with smaller dimension inputs, we use
the XY positions of the car along with the type of the XY state as inputs to the network. The type of the
state is a one-hot encoding of whether the state represents an obstacle/avoid region, goal, outside-workspace
or traversable region. We assume that a perception algorithm would provide the semantic label of each state.
39
The neural network contained 2 hidden layers with 100 and 200 nodes respectively; and used the
Adam optimizer [66] with batch training for 20 epochs and RMSE loss. It was trained using P yT orch on
a system with AMD Ryzen 7 3700X 8-core CPU and Nvidia RTX 2070-Super GPU. As we see in Figure 3.12f,
the predictions are closely correlated with the locations of various map features (boundaries, avoid regions
and goal). In all the experimental scenarios, the final rewards can be checked for consistencies w.r.t. the
specifications to detect hard violations. Additionally, we analyzed the effects of the number of samples
η and stochasticity, modeled as variance σ
2
, on rewards learned (see Figure 3.13). The hyperparameter
ranges are as follows: (i) σ ∈ [0.03, 0.3] and (ii) number of samples η ∈ [2, 30]. Further analysis on the
effects of hyperparameters on the neural network reward prediction is shown in Figure 3.14.
(a) High σ
2
, high η (b) High σ
2
, low η (c) Low σ
2
, high η (d) Low σ
2
, low η
Figure 3.13: Results showing effects of hyperparameters on cumulative rewards.
(a) σ
2 = 0, for any η ∈ N (b) Well-defined rewards (c) Low σ
2
, high η (d) Less accurate rewards
Figure 3.14: Results showing effects of hyperparameters on predicted rewards.
3.5.3 Discussion and Comparisons
From the discrete-world experiments, we can observe that the reward and policy learned by the robot
are consistent with all the STL requirements from the given initial condition without having the user to
explicitly specify/design rewards for the robot. Because the algorithm automatically performs ranking of
40
demonstrations, it can be interpreted as preference-based learning since it prefers to follow a demonstration that has “higher” satisfaction of the specifications. At a general level, the reward inference procedure
described in our framework is designed in such a way that it encourages the agent/learner to satisfy the given
STL specifications quickly. In the robotics domain, it is a coincidence that the goal/task is represented by
those specifications and hence the agent learns to achieve them quickly. That is to say, if the STL specifications are not aligned with the goal/task, then the agent will not achieve the goal, but will rather learn to
satisfy what the STL formula represents.
Another observation is that our method uses fewer demonstrations and can learn from sub-optimal
or imperfect demonstrations. One of the major highlights of our work is that we introduce minimum
additional hyperparameters and hence most of the hyperparameter tuning depends on the RL algorithm.
We also compared our method with state-of-the-art Maximum Causal Entropy IRL (MCE-IRL) [122] on
the grid-world and Mountain Car tasks. MCE-IRL and our method have the common objective of inferring rewards from demonstrations and mainly differ in the manner features are utilized: in MCE-IRL,
features are directly used in computing the solution, while ours indirectly accesses the features via specifications. Moreover, in MCE-IRL, the suboptimal demonstrations are assumed to be noisy w.r.t. some
model/distribution and hence cannot account for diversity or preferences of the demonstrators. In the
grid-world environment, the ground truth for a 5 × 5 grid-world is provided in which the goal is at the
top-right corner with reward +2 and the initial state is at the bottom-left. There are 2 states to avoid with
reward 0 and every other state where the agent can traverse has a reward of +1 (Figure 3.15a). The actual
values of the reward are not important since they can be easily interpreted/represented as potential based
reward functions which preserve policy optimality[85]. MCE-IRL requires at least 60 optimal demonstrations to recover an approximate reward, whereas our method can recover a more accurate reward with
just 3 (2 good and 1 bad) demonstrations, as shown in Figure 3.15. Similar results were obtained with other
grid-sizes used in the earlier experiments (see Figure 3.3). For Mountain Car with 50 × 50 discretization,
41
both MCE-IRL and our method obtained very similar rewards, with the former requiring at least 10 optimal demonstrations, while the latter used just 2 demonstrations. The ground truth for Mountain Car
is provided by the environment itself. Quantitative comparisons are shown in Table 3.1. Note that the
demonstrations provided for MCE-IRL are all nearly-optimal while the demonstrations for our method are
mixed (i.e., some good and some bad/sub-optimal). In addition, MCE-IRL was not able to learn an accurate
reward compared to the ground truth with a limited dataset of demonstrations. We also notice that MCEIRL does not perform well when there are multiple avoid regions/obstacles scattered over the map (e.g.,
Frozenlake) and in such cases, MCE-IRL requires significantly more demonstrations and suitably-designed
features. On smaller environments, the computation time for inferring rewards is similar for both algorithms. However, as the environment size increases, the computation time and number of demonstrations
increase significantly for MCE-IRL. Lastly, MCE-IRL was not able to recover the reward for the multiple
sequential goals even from 300 demonstrations, whereas our method was able to do so and found a policy
that visited both goals safely and in the shortest time (Figure 3.10). As we see in Figure 3.10, MCE-IRL has
2 problems: (i) it does not learn the reward for obstacles/avoid regions and (ii) it learns only when there are
2 independent terminal states, i.e., it does not consider the history or sequential visitation of goals. Hence
a policy with the MCE-IRL reward and our multi-sequential goal algorithm is forced to visit goal2 and then
goal1, thereby restricting the specification only to this order. MCE-IRL can also learn higher rewards for
states other than the terminal states, which leads to misaligned goals and is undesirable. Hence, for some
cases in experiments, the policy was able to visit only one goal while ignoring the other.
Unlike many existing IRL techniques, our method does not involve solving an MDP during the reward
inference procedure and the rewards inferred using our method are interpretable w.r.t. the specifications.
The complexity of the reward inference procedure is polynomial in the length of the specification and
demonstration [77], and hence it is not affected by the dimensionality of the state space based on empirical
evaluations in the discrete and continuous domains. As shown in experiments, our algorithm can be used
42
(a) Ground Truth (b) MCE-IRL (50 optimal demos) (c) Ours (3 demos)
Figure 3.15: Comparing rewards with ground truth and state-of-the-art MCE-IRL.
with multiple demonstrators each of whom may be trying to act according to their preferences for the
same task.
(a) GT reward function (b) MaxEnt-IRL (c) MCE-IRL (d) LfD-STL
Figure 3.16: Ground-truth (GT) rewards and rewards extracted by ME-IRL and MCE-IRL, respectively, each
using 300 optimal demonstrations.
For the stochastic discrete environments, MCE-IRL required around 300 demonstrations in the 5 × 5
grid world under identical stochastic conditions and over 1000 demonstrations for the F rozenlake−8×8.
Additionally, since MCE-IRL learns a reward for each state, it requires the demonstration set to cover all
possible states, while ours does not require this criteria and hence is more sample-efficient. A ground-truth
reward function for the 5 × 5 grid is shown in Figure 3.16a. Qualitatively, in Figure 3.16, we can observe
that the rewards using our method are more aligned with the ground truth, compared to the others. For
instance, the state at the center of the top row is an obstacle and both the IRL methods infer positive
rewards for that state, while ours correctly computes a negative reward for the same using fewer samples.
43
Table 3.1: Quantitative comparisons between MCE-IRL and our method for different environments.
#Demos Avg. Execution Time (in s)
MCE-IRL Ours MCE-IRL Ours
(5 × 5) grid 70 3 3.77 2.62
(7 × 7) grid 150 5 6.81 2.74
FrozenLake-4 150 4 3.96 2.81
FrozenLake-8 800 5 13.18 3.11
Mt. Car 10-20 2-3 > 60 2.95
Hence, a key takeaway from this framework is that the inclusion of symbolic AI (formal logic) reasoning
in the reward-inference methods can alleviate several restrictions/assumptions.
An interesting question that arises in our work is: “How to choose the stochasticity p?” In the stochastic
environment, we make use of the transition probabilities to compute the rewards. But in real-world scenarios, these probabilities are not readily provided and need to be inferred from observations. We suggest
a couple of ways to estimate the transition probabilities:
• Interactive/active learning: The simulator or agent could query the user at each state, segment (part
of a demonstration) or end of the demonstration to check whether the demonstration was as desired
by the user. The transition probability could be approximated by calculating the number of states or
segments that the user provided negative feedback or response over the entire demonstration.
• A deterministic agent could try to follow along the demonstration from the initial state till its end
and detect the number of encountered states that were not seen in the demonstration, i.e., how many
times it strayed away. This would model the state-wise transition function.
3.6 Summary
In this chapter, we formulated the problem of inferring reward functions for RL agents, from demonstrations and high-level task descriptions expressed via temporal logics. We introduced a framework - LfD-STL
- that combines the demonstrations and high-level STL specifications to: (i) quantitatively evaluate and
44
rank demonstrations, and (ii) infer non-Markovian rewards for a robot such that the computed policy is
able to satisfy all specifications. The proposed reward inference algorithm accounted for the uncertainty
in the environments to extract robust rewards and hence policies. We further extended this procedure to
continuous and high-dimensional spaces by utilizing function approximations such as Gaussian processes
and neural networks, which enabled the framework to even predict rewards for states not observed in the
demonstrations. The proposed framework was evaluated on a wide variety of environments ranging from
discrete to continuous spaces, across deterministic and stochastic dynamics. Through these experiments,
we showed how the LfD-STL framework leverages the neurosymbolic AI mechanisms to overcome the
drawbacks of prior literature by significantly reducing the data and algorithm complexities, and enabling
faster learning. The LfD-STL framework reasons about the quality of demonstrations and is able to efficiently learn even when the demonstrations are imperfect or suboptimal, thereby making it suitable for
real-world applications with non-expert users. The experiments on several stochastic discrete-worlds and
in the driving scenario (continuous domain) illustrate the robustness and scalability of our method. We
believe this work will provide new avenues to combine verification of both, rewards and learned policies
of agents to develop safe and interpretable control policies for applications in autonomous driving, ground
and aerial robot surveillance or patrol, household and medical assistants, etc.
3.7 Bibliographic Notes
In this section, we present several works that tackle the problem of LfD. As described below, LfD has been
widely used for 2 broad purposes: (i) to directly extract control policies for a robot and (ii) to learn the
specifications for the behaviors represented by demonstrations. The difference between these are that the
former uses demonstrations to learn a control policy via imitation learning and/or RL, while the latter
deals with learning temporal logic specifications from demonstrations to describe the task/environment,
and from which, control policies can be later synthesized.
45
Inverse Reinforcement Learning (IRL) Traditional IRL seeks to learn a reward function from optimal
demonstrations and hence, is susceptible to noisy demonstrations. Maximum Entropy IRL (ME-IRL) [123]
and Maximum Causal Entropy IRL (MCE-IRL) [122] are a couple of the state-of-the-art IRL techniques that
aim to find a reward function from nearly-optimal demonstrations, while also disambiguating inference by
maximizing the (causal) entropy of the resulting policy. This approach is robust to limited and occasional
suboptimality in the demonstrations, however, they require large amounts of data. Compared to IRL-based
methods which are analytical solvers, ours is more of a heuristic solution, as discussed in this chapter.
Additionally, IRL-based approaches typically require an MDP to be solved (e.g., via value iteration) during
each training loop to learn candidate reward functions, while ours does not. Futhermore, using temporal
logic specifications, we can express complex tasks involving multiple goals, which cannot be easily encoded
or represented in traditional IRL.
LfD from Suboptimal/Imperfect Demonstrations Methods that learn from suboptimal/imperfect
demonstrations are surveyed in [99]. In most cases, these methods filter sub-optimal or imperfect demonstrations or classify them as outliers when majority of the other demonstrations are optimal. Recently,
the authors of [23] proposed a framework to learn the latent reward function from suboptimal demonstrations by examining the relationships between the performance of a policy and the level of noise injected,
to synthesize optimally-ranked trajectories. In [108], the authors propose an approach in which nearlyoptimal demonstrations are used to learn rewards via IRL, that are later utilized for reward shaping in RL
tasks. However, at the core, these two methods build on ME or MCE-IRL and can learn more accurate
rewards compared to vanilla MCE-IRL at the cost of generating additional demonstrations [23] or manually defining an environment reward function in addition to the learned rewards [108]. Another work
involves active learning to infer rewards via ME-IRL by combining an automaton with the states of the
corresponding MDP [79] and can learn from suboptimal data, but requires actively querying the user for
feedback.
46
Altogether, the works described in the above sections investigate the performance/efficiency of LfD
algorithms and none of them consider safety aspects (i.e., provide guarantees or reason about safety).
LfD with Temporal Logics There is vast literature on methods that utilize the quantitative semantics
of temporal logics such as STL, Linear Temporal Logic (LTL), etc., to describe or shape reward functions
in the RL domain [7, 73, 12, 61, 5]. In [12], the authors propose to use STL and its quantitative semantics to generate locally shaped reward functions, that considers the robustness of a system trajectory for
some finite window of its execution resulting in a local approximation of the direction of the system trajectory. LfD with formal methods has been explored in [72] to learn complex tasks by designing a special
logic, augmenting a finite-state automaton (FSA) with an MDP formulation and then performing behavioral cloning to initialize policies that are later trained via RL. However, this work relies on optimal/perfect
demonstrations. The authors in [120] propose a counterexample-guided approach using probabilistic computation tree logics for safety-aware apprenticeship learning. They perform logic-checking during training
to achieve verification-in-the-loop and automatically generate a counterexample in case of violation. In
[58], the authors combine LfD with LTL by converting the LTL specifications into a loss function to learn
a dynamic movement primitive that satisfies the specifications and tries to imitate the demonstrations.
This work requires manually defining loss functions based on the quantitative semantics of LTL and it
does not seek to learn a reward function which is the main objective of our work. The authors in [25]
integrate a learning method with model-predictive control to design a controller that behaves similar to
expert demonstrations while trying to decide the trade-offs on how well to follow each STL rule using
slackness in robustness values. They assume that priorities among the STL rules are already given and
also assume that the experts are aware of the priorities among the STL rules and provide demonstrations
accordingly. However providing such demonstrations requires optimality and skill. LfD with high-level
side information encoded in co-safe LTL has been explored by the authors of [115]. This method learns a
reward function as well as an FSA that is augmented with the MDP states to learn from a handful of optimal
47
demonstrations. Some issues with this approach are that rewards and hence the FSA rely on demonstrations being optimal; and the augmented state space increases exponentially as the number and length of
specifications increase. Our framework on the other hand, achieves the objective without increasing the
state space. Another work [71] tries to extract a stochastic policy that satisfies the temporal logic specifications, which can be used once the rewards are inferred, perhaps using the method developed in this
thesis. Traditional motion-planning with chance constraints has been investigated in [89] to produce plans
with bounded risk. However, this method involves expert-designed cost functions, solving numerous constraints and manually ranking or selecting among various feasible paths. Our work differs in that the
constraints are now replaced by formal specifications and the costs are rewards that are inferred. Based on
these, the demonstrations are automatically ranked and a new robot policy is learned. To summarize, the
majority of the prior research that uses formal logic focuses on verification of the learned policies whereas
we try to infer rewards that are consistent with the high-level task specifications.
The authors of [4] have explored reward hypothesis methods to determine the expressivity of Markov
rewards for RL tasks. Similar to the work presented in this thesis, they have investigated several crucial
concepts in RL such as formalizing task descriptions, systematically creating partial ordering of trajectories/demonstrations and policies, and learning reward functions. In contrast, the LfD-STL framework
offers several critical advantages over the methods of [4] such as: employing non-Markovian rewards
via task-based temporal logics, generalizing to stochastic dynamics and continuous spaces, and having a
significant reduction in sample and computation complexity.
48
Chapter 4
Learning Performance Graphs from Demonstrations
In human-robot interaction, understanding the behaviors exhibited by humans and robots plays a key role
in robot learning, improving task efficiency, collaboration and mutual trust via explainable AI [84, 67, 91,
103]. As we have seen in the previous chapter, demonstrated behaviors can be evaluated using symbolic
AI such as temporal logics, to infer a cumulative reward function that assigns behaviors a numeric value;
the implicit assumption here is that higher cumulative rewards indicate good behaviors. Such cumulative
rewards are then used with reinforcement learning (RL) to learn an optimal policy. The LfD-STL framework introduced in the previous chapter uses STL specifications to infer rewards from user-demonstrations
(some of which could be suboptimal or unsafe) and has shown to outperform inverse reinforcement learning (IRL) methods [122, 108, 41] in terms of number of demonstrations required and the quality of rewards
learned.
In the LfD-STL framework, STL specifications describe desired objectives, and demonstrations show
how to achieve these objectives. Typically, robotic systems are difficult to characterize using a single specification, and users may thus seek for policies that satisfy several task specifications. Not all specifications
may be equally important; for example, a hard safety constraint is more important than a performance
objective as described in Section 3.2. LfD-STL permits users to thus provide several specifications, but as
we saw in the previous chapter, it also requires them to manually specify their preferences or priorities
49
on specifications. These preferences are then encoded in a directed acyclic graph (DAG) that, which we
will hereafter refer to as the performance graph. That is, a key limitation of the LfD-STL framework is that
the onus on providing the performance graph is on the user, which becomes infeasible when there are
numerous task specifications. This manual ranking of specification is dependent on the nature/structure
of the specification, i.e., it is specification-driven rather than being data-driven. Moreover, there could
exist multiple ways of performing a task and a challenge in LfD is whom the agent should imitate, i.e.,
disambiguate the demonstrations. We propose a solution to these problems by using the STL specifications and quantitative semantics to not only evaluate the performance of demonstrations, but also to use
the evaluations to infer the performance graph. We propose the Performance-Graph Learning (PeGLearn)
algorithm that systematically constructs DAGs from demonstrations and STL evaluations, resulting in a
representation that can be used to explain the performance of provided demonstrations. Based on a given
performance graph and the quantitative semantics of each STL specification, the LfD-STL framework discussed in Chapter 3 then defines a state-based reward function that is used with off-the-shelf RL methods
to extract a policy.
In complex environments where it is non-trivial to express tasks in STL, we use human annotations
(ratings or scores) of the data. Examples of complex tasks in human-robot interactions can include descriptions like “tying a knot” or having “fluency in motion” in robotic surgery, where even experts can struggle
to express the task in formal logic, or specifying the task formally may not be expressible in a convenient
logical formalism. In our setting, rating scales can replace temporal logics by (i) choosing queries that
assess performance and (ii) treating the ratings/scores as quantitative assessments1
. There is precedence
of such quantitative assessments; for example, Likert ratings from humans are used as ground-truth measurements of trust [24].
1Here we assume that Likert scales are interval scales[88].
50
Demo 1
Demo 2
Demo M
local DAG intermediate
graph
final DAG
1
2
3
1
2
3 LfD-STL
Figure 4.1: Overview of PeGLearn algorithm.
4.1 PeGLearn Framework
We propose the Performance-Graph Learning (PeGLearn) algorithm that aims to automatically infer the
DAGs from demonstrations and task specifications, thus eliminating the need for users to define the task
priorities in a DAG. A high-level overview of this data-driven algorithm used in conjunction with prior
LfD-STL framework is shown in Figure 4.1.
We first generalize the task descriptions to also be expressed via other variants of temporal logic like
LTL, or by natural-language performance queries that are associated with interval ratings like Likert. We
thus define a generic rating function that provides a numeric value of the performance of a demonstration
on a specification under consideration.
Definition 4.1.1 (Rating Function). A rating function R is a real-valued function that maps a specification and a time-series data or trajectory to a real number, i.e., R : Φ × Ξ → R, where, Φ and Ξ
are the sets of task specifications and demonstrations, respectively.
Intuitively, the rating function describes how “well” the specifications are met (satisfied) by a trajectory. The rating function can be obtained via the quantitative (robustness) semantics in temporal logics
51
or human ratings via surveys, annotations, etc. It indicates the score or signed distance of the time-series
data to the set of temporal data satisfying a specification. For a given specification φ and a demonstration
ξ, the rating (also referred to as evaluation or score) of ξ with respect to φ is denoted by ρ = R(φ, ξ). This
ρ is negative if ξ violates φ, and non-negative otherwise.
Similar to the problem definition in Chapter 3, to accomplish a set of tasks, we are given: (i) a finite dataset of m demonstrations Ξ = {ξ1, ξ2, · · · , ξm} in an environment, where each demonstration
is defined as in Def. 2.1.2, (ii) a finite set of n specifications Φ = {φ1, φ2, · · · , φn} to express the highlevel tasks and by which a set of scores for each demonstration evaluated on each of the n specifications
ρˆξ = [R(φ1, ξ), · · · , R(φn, ξ)]T
is obtained2
. We can then represent this as an m × n matrix B where
each row i represents a demonstration and a column j represents a specification. An element ρij indicates
the rating or score of demonstration i for specification j, i.e., ρij = R(φj , ξi).
B =
ρ11 ρ12 · · · ρ1n
ρ21 ρ22 · · · ρ2n
.
.
. · · ·
.
.
.
.
.
.
ρm1 ρm2 · · · ρmn
=
ρˆξ
T
1
ρˆξ
T
2
.
.
.
ρˆξ
T
m
(4.1)
As previously stated in the reward inference mechanism, we need to compute a cumulative score
or rating rξ for each demonstration to collectively represent its individual specification scores, and so
we have an m × 1 vector r = [rξ1
, rξ2
, · · · , rξm]
T
. To obtain the cumulative scores, we also have the
scalar quantity or weight associated with each specification w(φ), resulting in a weight vector wΦ =
[w(φ1), w(φ2), · · · , w(φn)]T
from which we can obtain the cumulative scores as B · wΦ = r. In other
words, for each demonstration ξ, rξ = w
T
Φ
· ρˆξ. The objective is to compute both wΦ and r, given only
2Note that the evaluations of the task specifications must be normalized as indicated in Chapter 3
52
B, such that the “better” demonstrations have higher cumulative scores than others and are ranked appropriately by the LfD-STL framework, i.e., a partial order is generated; this is an unsupervised learning
problem.
This partial order of specifications is now determined by the structure of the specification-ranked
DAG that contains the elements of Φ as its vertices, and the relative differences in performance between
specifications as edges. The intuition is that an edge is directed from a higher priority node to a lower
one as observed by the demonstrations. We refer to this as the performance-graph since it captures the
performance of the demonstrations w.r.t. the task specifications. This final graph is required to be acyclic so
that topological sorting can be performed on the graph to obtain an ordering of the nodes via Equation 3.1
and hence specifications, i.e., topological ordering does not apply when there are cycles in the graph.
An assumption that we make is that, at least 1 demonstration satisfies all the specifications of Φ, but
does not have to be optimal (i.e., having the highest rating) w.r.t. those specifications; we argue that this is
a reasonable assumption to make compared to related LfD works like IRL that require a large sample size
of nearly-optimal (i.e., close to the highest rating) demonstrations as discussed in the previous chapter.
This assumption is required to also show that the task(s) can be realized, even if suboptimally, under the
given specifications. The last assumption is that the demonstrators’ intentions are accurately reflected (in
terms of performance) in the demonstrations provided, therefore we consider each demonstration equally
important when inferring graphs.
4.1.1 PeGLearn Algorithm Overview
In this section, we describe the procedure to create the performance graph from ratings or scores obtained
either automatically by temporal logics or provided by human annotators. This process involves 3 main
steps:
53
1. Constructing a local weighted-DAG for each demonstration based on its individual specification
scores.
2. Combining the local graphs into a single weighted directed graph, which is not necessarily acyclic
as it can contain bidirectional edges between nodes.
3. Converting the resultant graph into a weighted DAG.
The framework in Figure 4.1 depicts the 3 steps described above and the final stage where the inferred
DAG is fed to the LfD-STL framework to learn rewards and perform RL.
4.1.2 Generating local graphs
Each demonstration ξ ∈ Ξ is associated with a vector of ratings ρˆξ = [R(φ1, ξ), · · · , R(φn, ξ)]T
, and the
objective is to construct a weighted DAG for ξ from these evaluations. We propose Algorithm 2, where,
initially, the evaluations are sorted in non-increasing order with ties broken arbitrarily (lines 3–5). This
creates a partial ordering based on the performance of the demonstrations regarding each specification,
and is represented by a DAG to capture the partial ordering. Though DAGs can be represented by either adjacency lists or adjacency matrices, in this work, we represent them using adjacency matrices for
notational convenience.
Consider 4 specifications φi
;i ∈ {1, 2, 3, 4}. Let a demonstration, say ξ ∈ Ξ have evaluations ρˆξ =
[ρ1, ρ2, ρ3, ρ4] where each ρi = R(φi
, ξ) for notational convenience, and without loss of generality, let
them already be sorted in non-increasing values, i.e., ρi ≥ ρj ; ∀i < j. This sorting is performed in the
first for loop of Algorithm 2. Recall that each node of the DAG represents a specification of Φ, i.e., a node
contains the index of the specification it represents. An edge between two nodes φi and φj is created
when the difference between their corresponding evaluations is greater than a small threshold value (lines
6–14). This edge represents the relative rating or performance difference between the specifications and
54
Algorithm 2: Algorithm to compute local DAG for a single demonstration.
Input: ξ := a demonstration of any length L; Φ := set of n specifications; ϵ: threshold (tunable)
Result: Constructs local Performance-Graph Gξ
1 begin
2 Gξ ← 0n×n // zero matrix
3 Q ← ∅ // Create an empty queue
4 for j = 1 to n do
5 Obtain the rating or score sj for specification j; Q.insert(⟨j, sj ⟩)
// Resulting S is an n × 2 matrix where each row is ⟨index, score⟩
6 S
′ ← sort Q in non-increasing order of scores // original indices are recorded
7 for k = 1 to n − 1 do // no self-loops
8 φ ← S
′
[k, 1] // get index
9 v ← S
′
[k, 2] // get score
10 for j = k + 1 to n do
11 φ
′ ← S
′
[j, 1]
12 v
′ ← S
′
[j, 2]
13 if (v − v
′
) ≥ ϵ then
14 Gξ[φ, φ′
] ← Gξ[φ, φ′
] + v − v
′
15 return Gξ
creates a partial order indicating this difference. The threshold ϵ acts as a high-pass filter and can be
tuned depending on the normalization of ratings. The intuition is that demonstrations having similar
states or features will have similar evaluations for the specifications, and should produce the same partial
ordering of specifications. That is, an edge is created if the evaluations differ greatly, e.g., two specifications
producing ratings say, 1.0 and 0.99, are numerically different, but have similar performance, so they should
be equally ranked (i.e., have no edge between them). Without this filter, an edge with a very small weight
would be created between them, thereby inadvertently distinguishing similar performances. Formally,
e(φi
, φj ) is added when δij = (ρ(φi) − ρ(φj )) ≥ ϵ
3
. We repeat this process for each node in the DAG
(Figure 4.2) and the resultant DAG will have at most n(n − 1)/2 edges, where n = |Φ|. The local graph is
acyclic, because the nodes are sorted by their respective evaluations in a non-increasing order and hence
edges with negative weights will not be added thereby eliminating any bidirectional edges. The DAG for a
demonstration imposes a partial order over all specifications. For any 2 specifications φi and φj , φi ⪰ φj if
3The demonstration ξ argument in ρ was dropped for convenience since we are only considering 1 demonstration at a time.
55
φ1 φ2
φ3 φ4
δ12
δ13
(a) Step 1
φ1 φ2
φ3 φ4
δ12
δ13
δ23
δ24
(b) Step 2
φ1 φ2
φ3 φ4
δ12
δ13
δ23
δ24
δ34
(c) Step 3
Figure 4.2: Example local graph for a demonstration.
ρ(φi) ≥ ρ(φj ) and so, an edge is created from φi to φj with weight ρ(φi)−ρ(φj ) subject to the threshold
ϵ.
Complexity Analysis: In general, given n specifications and a set of algebraic operators (e.g., op = {>
, =}), the number of different orderings is: n! · [|op|
n−1 −1] + 1. In our case, |op| = 2 since the operator <
in an ordering is equivalent to a permutation of the ordering using >, i.e., a < b ≡ b > a. By making use
of directed graphs, we can eliminate the factorial component, as discussed below, but this still results in an
exponential-time search algorithm. To overcome this, in our algorithm, we eliminate cycles by building a
DAG for each of the m demonstrations. Depending on the data structure used, the complexity of building
a DAG is linear when using adjacency lists and quadratic when using adjacency matrix to represent the
graph. The total complexity is thus O(mn2
) in the worst case (using matrix representation).
Space of all directed graphs In regard to the number of different orderings in extracting local graphs,
given n specifications, the number of permutations or arrangements is n!. For each permutation, there is 1
operator from op = {>, =} that can be placed in between any two specifications (e.g., a > b). The number
of such “places” is n − 1 and hence the number of operator arrangements for each permutation is 2
n−1 or
|op|
n−1
in general. However, we can observe that one of the arrangements for each permutation consists
of the ‘=’ operator appearing in all the “places”. For example, a = b = c is the same as the permutation
b = a = c and so on. Hence, all the n! permutations share this common/redundant ordering, and so we
need to remove all but one of them. Thus, the total number of unique orderings over all the permutations
56
is n! · |op|
n−1 − n! + 1 = n! · [|op|
n−1 − 1] + 1. Following the use of directed graphs to reduce this
search space, we first need to derive the number of possible directed graphs. For a directed graph without
self-loops, there are 3 possible edge categories between any two nodes - no edge, incoming (outgoing from
other) and outgoing (incoming to other). In the worst case, the maximum number of edges in a DAG is
n(n − 1)/2 edges and so the total number of possible directed graphs is 3
n(n−1)/2
. This includes all the
cycles formed in the directed graph, so we then need to compute and subtract the number of cycles to
obtain the actual space of DAGs. For a directed graph with n vertices, a cycle comprises at least 3 vertices
because we allow only 1 edge to exist between any two nodes. So, adding the number of cycles, we get:
Xn
k=3
n
k
=
Xn
k=0
n
k
−
X
2
k=0
n
k
= 2n − [
n!
0!(n − 0)! +
n!
1!(n − 1)! +
n!
2!(n − 2)!]
= 2n − [1 + n + n(n − 1)/2]
We can then reverse the edges and obtain another 2
n − [1 + n + n(n − 1)/2] cycles, therefore the total
number of cycles is twice this number = 2n+1 −(n
2 +n+ 2). Finally, the number of valid directed graphs
is 3
n(n−1)/2 − [2n+1 − (n
2 + n + 2)] = 3n(n−1)/2 − 2
n+1 + n
2 + n + 2, which is still exponential, but has
eliminated the factorial component of the search space.
4.1.3 Aggregation of local graphs
Once the local graphs for each demonstration have been generated, they need to be combined into a
single DAG to be used directly in the LfD-STL framework from Chapter 3. We now propose Algorithm 3
to aggregate all local graphs into a single DAG. Line 2 generates the local graphs via Algorithm 2 and stores
them in a dataset G. For every directed edge between any pairs of vertices u and v, the mean of the weights
on corresponding edges across all graphs in G is computed (line 3 of Algorithm 3). For example, consider
57
φ1 φ2
φ3 φ4
w
1
12
w13
1
w
1
14
w24
1
(a) Demo 1
φ1 φ2
φ3 φ4
w
2
21
w
2
23
w24
2
w
2
43
(b) Demo 2
φ1 φ2
φ3 φ4
w
′
12
w13
′
w
′
14
w
′
21
w
′
23
w24
′
w
′
43
(c) Interim Graph
φ1 φ2
φ3 φ4
w13
′
w
′
14
w
′
23
w24
′
w
′
43
(d) Final DAG
Figure 4.3: Example global graph from 2 demonstrations.
the local graphs of 2 sample demonstrations shown in Figure 4.3. By averaging the edge weights of the
graphs of the 2 demonstrations, we get the intermediate weighted directed graph shown in Figure 4.3c.
This is not necessarily acyclic since there is a cycle between the nodes of φ1 and φ2. In this figure, each
w
′
ij = (w
1
ij + w
2
ij )/2. This intermediate graph needs to be further reduced to a weighted DAG, i.e., by
eliminating any cycles/loops.
Algorithm 3: PeGLearn: Generating the global DAG from all demonstrations.
Input: D := set of m demonstrations; Φ := set of n specifications; ϵ: threshold (tunable)
Result: Constructs global Performance-Graph
1 begin
2 G ← Sm
i=1 Gi // via Algorithm 2
3 G′ ← 1
|G|
P
G∈G G // Edge-wise mean
// Extracts edge-weighted DAG from the raw Performance-Graph
4 G ← 0n×n // zero matrix
5 for i = 1 to n do
6 for j = 1 to n do
7 G[i, j] ← max(0, G′
[i, j] − G′
[j, i])
8 if G[i, j] < ϵ then G[i, j] ← 0
9 return G
4.1.4 Conversion/Reduction to weighted DAG
Note that there can only be at most 2 edges between any pair of vertices since the outgoing (and similarly, incoming) edges are averaged into a single edge. In order to reduce this graph to a global DAG, we
systematically eliminate edges by first computing the difference between the outgoing and incoming edge
58
and then checking if it is above a certain threshold to add an edge in the direction of positive difference
(note that if the difference is negative, the edge is simply reversed). In other words, for any 2 nodes, u and
v, if (w(u, v) − w(v, u)) ≥ ϵ, then e(u, v) is retained with new weight w(u, v) − w(v, u), while e(v, u) is
removed or discarded since it gets absorbed by the retained edge. The threshold ϵ again acts as a high-pass
filter. As we can observe in the case of bidirectional edges, one of the edges will be “consumed” by the
other or both will be discarded if they are similar. This conversion procedure is shown by lines 5–8 in
Algorithm 3. Thus, all cycles/loops are eliminated, resulting in a weighted DAG that can be directly used
to rank the demonstrations and compute rewards for RL tasks as performed in the LfD-STL framework. To
show that our DAG-learning method indeed preserves the performance ranking over demonstrations, we
first define a partial ordering over demonstrations: for any 2 demonstrations ξ1 and ξ2, the partial order
ξ1 ⪯ ξ2 is defined when ρ1i ≤ ρ2i
, ∀i ∈ {1, · · · , n}. Thus, we say that ξ2 is better or at least as good as ξ1.
Then, by making use of Lemma 4.1.1, we arrive at Theorem 4.1.1 that addresses the problem definition.
Lemma 4.1.1. For a DAG, the weights associated with the nodes computed via Equation 3.1 are non-negative.
Proof Sketch. From the LfD-STL framework, the weights for specifications represented by the DAG nodes
are given by Equation 3.1. We know that |Φ| = n and ancestor(φ) is a set whose cardinality is nonnegative. In a DAG, there are no cycles and hence |ancestor(φ)| is an integer in [0, n − 1]. By this
equation, the minimum weight (i.e., worst-case) for any node representing a specification φ occurs when
that node is a leaf and all other n − 1 nodes are its ancestors. Therefore, w(φ) = |Φ| − ancestor(φ) =⇒
w(φ) = n − (n − 1) = 1 ≥ 0. Similarly, the maximum value of w(φ) is n, i.e., there are no ancestors
when φ is one of the root nodes in the DAG. This non-negative nature of weights also holds true when the
weights are normalized via a softmax function since it is used to represent a statistical distribution that
lies in the interval [0, 1].
59
Theorem 4.1.1. For any two demonstrations ξ1 and ξ2 in an environment, the partial ordering ξ1 ⪯ ξ2 is
preserved by PeGLearn.
Proof. Recall that for any two demonstrations ξ1 and ξ2 in an environment, if ξ1 ⪯ ξ2, then the cumulative
rating/scores are such that rξ1 ≤ rξ2
. Also recall the notation that ρˆξi = [ρi1, · · · , ρin]
T
. Let there be n
specifications for the environment, then for these two demonstrations, we have:
B =
ρ11 ρ12 · · · ρ1n
ρ21 ρ22 · · · ρ2n
and wΦ = [w1, w2, ..., wn]
T
. W.l.o.g., let ξ2 be at least as good as ξ1, i.e., we have ρ1j ≤ ρ2j , ∀j ∈
{1, · · · , n}. The cumulative scores for the demonstrations are rξi = w
T
Φ
· ρˆξi where i ∈ {1, 2}. For any
constant wj ≥ 0,
wj · ρ1j ≤ wj · ρ2j
=⇒
Xn
j=1
wj · ρ1j ≤
Xn
j=1
wj · ρ2j
=⇒ w
T
Φ · ρˆξ1 ≤ w
T
Φ · ρˆξ2
=⇒ rξ1 ≤ rξ2
This holds iff wj ≥ 0, ∀wj ∈ wΦ
Once the global DAG is learned, the weights for specifications (nodes) are computed via Equation 3.1.
From Lemma 4.1.1, we have shown that these weights are all non-negative. Since the LfD-STL framework
ranks the demonstrations by their cumulative scores, this guarantees that better demonstrations are always
ranked higher than the others, i.e., a partial order is created, and also provide justification for the use of
DAGs.
60
The global DAG imposes a partial order of specifications. For any 2 specifications φi and φj , the partial
order φi ⪰ φj is defined when ρ¯(φi) ≥ ρ¯(φj ), where ρ¯(φ) is the mean of ρ(φ) across all demonstrations.
The global graph thus uses a holistic approach to explain the overall performance of demonstrations and
could provide an intuitive representation for non-expert users to teach agents to do tasks as well as understand the policies the agent is learning.
4.2 Experiments
We now describe the experiments on which PeGLearn was evaluated. Firstly, to compare PeGLearn with
user-defined specification DAG, we use the same discrete-world and 2D-driving simulators from Section 3.5, under the same task STL specifications. The STL formulas in our discrete-world and 2D driving
experiments were specified and evaluated using Breach [32]. We then evaluate PeGLearn on a simulation
of a widely-used mobile industrial robot - MiR100 - for a reach-avoid navigation task. Finally, to show that
the specification rankings from PeGLearn are interpretable and similar to human rankings, we conduct a
user study in the widely popular CARLA driving simulator. The specifications for the MiR100 and CARLA
experiments were evaluated in RTAMT library [87]. Note that the complexity of evaluating a trajectory
w.r.t. a temporal logic specification is polynomial in the length of the signal and specification [77]. However, tools such as Breach and RTAMT are capable of evaluating specifications with linear-time complexity.
In the 2D driving simulator experiment, we used the same neural network architecture as in Section 3.5
that was trained using P yT orch. All experiments were performed on a desktop machine with AMD Ryzen
7 3700X 8-core CPU and Nvidia RTX 2070-Super GPU.
Benchmarking PeGLearn and Baselines For comparison with user-defined DAG in the LfD-STL baseline, we evaluated our method on the same discrete-world and 2D autonomous driving domains using the
same demonstrations (m = 8) and specifications.
61
In the discrete environments, PeGLearn was evaluated against (i) user-defined DAGs, (ii) MaxEntropy
IRL, and (iii) MaxCausalEntropy IRL. Once rewards were extracted from each algorithm for all environment
settings, we used Double Q-Learning [51], as it is suited for stochastic settings, with the modifications
to the algorithm at 2 steps (reward update and termination) as described in Section 3.4. The number of
episodes varied based on environment complexity such as grid size, number and locations of obstacles. The
discount factor γ was set to 0.8 and ϵ-greedy strategy with decaying ϵ for actions was used. A learning rate
of α = 0.1 was found to work reasonably well after analyzing hyperparameters. Our evaluations over 100
trials showed that policies independently learned from PeGLearn and manually-defined DAGs were able
to achieve a task success rate of 80% and 81% respectively for the environments with 0.2 stochasticity.
However, the execution time of PeGLearn was within a 2-second increment over that of LfD-STL with
manually-specified DAGs.
For the 2D-driving experiment, we used the same kinematic model as in Section 3.5.2 for a car, described
by the following equations:
x˙ = v · cos(θ) + N (0, σ2
); ˙y = v · sin(θ) + N (0, σ2
)
v˙ = u1 · u2;
˙θ = v · tan(ψ); ψ˙ = u3
where, x and y represent the XY position of the car; θ is the heading; v is the velocity; u1 is the
input acceleration; u2 is the gear indicating forward (+1) or backward (-1); u3 is the input to steering
angle ψ. The state of the car at time t is given by St = [x, y, θ, v, x,˙ y,˙
˙θ, v˙]
T
. This resembles a realworld scenario wherein, one of the challenging problems in autonomous driving is overtaking moving
or stationary/parked vehicles on road-sides (e.g., urban and residential driving). The scenario presented
here is a high-level abstraction where the purple square is a parked car and the yellow square is the goal
state of the ego car after overtaking the parked car. The light-yellow shaded region are the dimensions of
the road/lane and the task for the ego car is to navigate around the parked car to the goal state without
62
Figure 4.4: Demonstrations collected for the 2D car simulator.
exiting the lane. Demonstrations are provided by users via an analog Logitech G29 steering with pedal
controller or via keyboard inputs. For comparison with prior work, we utilized the same 8 (6 good and 2
bad) demonstrations recorded earlier (Figure 4.4). The distance metric used in this space is Euclidean and
the specifications for this scenario are as follows:
1. Avoid obstacles at all times (hard requirement): φ1 := G[0,T]
(dobs[t] ≥ dSafe), where T is the
length of a demonstration and dobs is the minimum distance of the car from the hindrance region H
computed at each step t. For our experiments, we used dSafe = 3 units.
2. Always stay within the workspace/drivable region (hard requirement):
φ2 := G[0,T]
((x, y) ∈ Box(30, 25)), where the workspace is defined by a rectangle of dimensions
30 × 25 square units. The Box is an indicator for the real-valued data in the OpenAIGym library.
3. Eventually, the robot reaches the goal state (soft requirement): φ3 := F[0,T]
(dgoal[t] < δ), where
dgoal is the distance between centers of car and goal computed at each step t and δ is a small tolerance
when the center of the car is “close enough” to the goal’s center. φ3 depends on φ1 and φ2 in the
DAG.
4. Reach the goal as fast as possible (soft requirement): φ4 := F[0,T]
(t ≤ TG), where TG is an estimate
of the timesteps to reach the goal based on the mean length of demonstrations.
63
φ1 φ2
φ3 φ4
(a) Learned DAG (b) Driving layout (c) Baseline rewards (d) PeGLearn rewards
Figure 4.5: Results for the 2-D autonomous driving simulator. Baseline rewards are from user-defined
DAG.
The rewards are assigned to states by modeling the states s as samples of multi-variate Gaussian distribution N (µ, σ2
I) where µ = s and σ represents the deviations in noise levels, that can be tuned. Here,
we use σ = 0.03. For each s, we generated k = 20 samples to represent the reachable set and assigned
stochastic rewards as described in Section 3.3. The neural network used for regressing the rewards consisted of 2 layers with 200 neurons in each layer that were activated by ReLU. The reward inference for
both PeGLearn and manual-DAG baseline had execution times of less than 30 seconds.
We show the results for the 2D driving scenario in Figure 4.5. From these figures, we can observe that
the rewards using the inferred DAG are consistent with the specifications, i.e., rewards are aligned with
entity locations. In the discrete-world settings, we were able to learn similar graphs and rewards, and
hence same policies as the baseline. This shows that our proposed method is a significant improvement
over the prior work since it eliminates the burden of the user to define graphs, while also using at least 4
times fewer demonstrations than IRL-based methods [123, 122].
(a) Environment setup
φs φv φw
φg φt
(b) Learned DAG (c) Expert rewards (d) PeGLearn rewards
Figure 4.6: Results for the MiR100 navigation environment. (a) Obstacles and boundary walls are shown
in red point clouds; green sphere is the goal.
64
Industrial Mobile Robot Navigation In this setup, we consider a high-fidelity simulator [74] based
on the mobile robot MiR100 that is widely used in today’s industries. In this environment, the robot is
tasked with navigating to a goal location while avoiding 3 obstacles (Figure 4.6). However, the locations of
the robot, goal and 3 obstacles are randomly initialized for every episode. This presents a major challenge
for LfD algorithms since the demonstrations collected on this environment are unique to a particular
configuration of the entity locations (i.e., no two demonstrations are the same). The 20-dimensional statespace of the robot, indexed by timestep t consists of the position of the goal in the robot’s frame in polar
coordinates (radial distance pt and orientation θt
), linear (vt
) and angular (wt
) velocities, and 16 readings
from the robot’s Lidar for detecting obstacles (x
i
t
, i ∈ [1, 16]). We obtained 30 demonstrations, of which
15 were incomplete (i.e., collided with obstacle or failed to reach the goal in time) by training an RL agent
on an expert reward function and recording trajectories at different training intervals. The specifications
governing this environment are:
1. Eventually reach the goal: φg := F(pt ≤ δ), where δ is a small threshold set by the environment to
determine if the robot is sufficiently close to the goal to terminate the episode.
2. Always maintain linear and angular velocity limits: φv := G(vmin ≤ vt ≤ vmax), and φw :=
G(wmin ≤ wt ≤ wmax). The limits conform to the robot’s capabilities.
3. Always avoid obstacles: φs := G(
V16
i=1(x
i
t > 0)), where x
t
i
is the distance from an obstacle as
measured by Lidar i.
4. Reach the goal in a reasonable time: φt
:= F(Tmin ≤ t ≤ Tmax), where the time limits are obtained
from the average lengths of good demonstrations observed in this environment.
Once the graph is extracted via PeGLearn, the rewards are propagated to the observed states and
modeled with a neural network as described in Section 3.3. We used a 2-hidden layer neural network
with 512 nodes in each layer for reward approximation. The DAG and rewards inferred are shown in
65
Figure 4.6. We can observe that the rewards are semantically consistent with the specifications, in that,
rewards increase as the robot moves towards the goal (center of figure) in a radial manner. To compare
the policies learned with our method and expert-designed dense rewards, we independently trained two
RL agents via: (i) PPO (on-policy, stochastic actions) [105] and (ii) D4PG (off-policy, deterministic actions)
[13], on each of the reward functions and evaluated the policies over 100 trials.
Despite being presented with only 50% successful demonstrations, our method achieved a success rate
of (i) 79% compared to 81% for expert rewards when using PPO, with 29% improvement over demonstrations, and (ii) 90% compared to 93% for expert rewards when using D4PG, with 40% improvement
over demonstrations. This indicates that our method is capable of producing expert-like behaviors. Furthermore, this particular environment [74] has shown to be readily transferable to real-robots without any
modifications (i.e., the sim2real transfer gap is almost nil), which also indicates the real-world applicability
of our framework.
Given the 30 demonstrations, PeGLearn was able to extract the rewards within 30 seconds. The expert rewards that were used for this task comprised of the following components: (i) Euclidean distance
between the robot and goal, (ii) power used by the robot motors for linear and angular velocities, (iii)
distance to obstacles to detect collisions, and (iv) an optional penalty if the robot exited the environment
boundary walls. The PPO agent that was trained separately on the expert and PeGLearn rewards used the
default architecture and hyperparameter settings from [98]. Both training sessions were run for 3 · 106
steps, with each session lasting about 30 hours on our hardware. Likewise, the D4PG agent was trained
independently on each reward function, under similar training conditions (hyperparameters shown in Table 4.1), with each training session lasting about 12 hours. Each of these RL agents were then evaluated
on 100 test runs (trials) to compute the success rates.
CARLA Driving Simulator We evaluated our method on a realistic driving simulator, CARLA [35],
on highway and urban driving scenarios. The demonstrations for this experiment used the same analog
66
Table 4.1: D4PG hyperparameters for the MiR100 safe mobile robot navigation environment.
Hyperparameter Value
Actors 5
Learners 1
Actor MLP 256 → 256
Critic MLP 256 → 256
N-Step 5
Atoms 51
Vmin -10
Vmax 10
Exploration Noise 0.3
Discount Factor 0.99
Mini-batch Size 256
Actor Learning Rate 5 · 10−4
Critic Learning Rate 5 · 10−4
Memory Size 106
Learning Batch 64
hardware for the 2D car simulator (Figure 4.7). The states of the car provided by the environment are:
lateral distance and heading error between the ego vehicle to the target lane center line (in meter and
rad), ego vehicle’s speed (in meters per second), and (Boolean) indicator of whether there is a front
vehicle within a safety margin. Based on this information, we formulated 3 STL specifications that are
similar to the 2D driving and mobile navigation experiments, which are:
1. Keeping close to the center of the lane:
φ1 := G[0,T]
(d
lane
t ≤ δ), where T is the length of a demonstration, d
lane
t
is the distance of car from
the center of the lane at each step t and δ is a small tolerance factor. The width of a typical highway
lane in the US is 12 ft (3.66 m) 4
and the average width of a big vehicle (e.g., SUV or pickup truck) is 7
ft (2.13 m) 4
, which leaves about 2.5 ft (0.76 m) of room on either side of the vehicle. Hence, we chose
to use 1 ft (0.3 m) as the tolerance factor to accurately track the lane center while also providing a
small room for error.
4Based on USDOT highway and US vehicle specifications (e.g., Ford F-150).
67
Figure 4.7: Teleoperation demonstrations in CARLA.
2. Maintaining speed limits:
φ2 := G[0,T]
(vmin ≤ vt ≤ vmax), where vt
is speed of the ego/host car at each timestep t, and
vmin and vmax are the speed limits. Since it is a US highway scenario, the vmax = 65 mph and
vmin = 0 mph.
3. Maintaining safe distance from any lead vehicle: φ3 := G[0,T]
(safety_flagt ≤ 0),
where safety_flagt is a binary signal that outputs 0 if the ego is safe (i.e., there is no vehicle
directly in front of the ego in the same lane whose distance is closer than some threshold dsafe) and
1 otherwise. In OpenAI Gym-CARLA, the safe distance was set to 15 m.
For this scenario, we recorded 15 demonstrations which were split uniformly across 3 batches where,
the 5 videos in each batch showed a common behavior, but the behaviors were different across the batches.
All videos were exclusive to their respective batches, i.e., no video was used in more than 1 batch, and each
video was 30 seconds long on average.
User Study The recorded driving videos were used to perform a user study to determine if users
would rate the driving behavior similarly, thereby providing evidence that the graphs generated using
PeGLearn produce accurate ranking of specifications. Using the Amazon Mechanical Turk (AMT) platform,
68
we created a survey of the 3 batches representing the split of the 15 videos. The web-based questionnaire
showed a batch of 5 videos to a participant, where each video was accompanied by questions regarding the
task specifications: (i) performance of demonstrator on each of the 3 task specifications described in natural
language and (ii) ranking of specifications based on overall driving behaviors. A still of one of the videos
is shown in Figure 4.8. Users were presented with a dropdown menu in which each option was a Likert
rating from 1 (lowest) to 5 (highest). We then presented users with a question to rate the overall behavior
of all 5 videos in the batch w.r.t. the task specifications on a scale of 1 (lowest) to 3 (highest). Finally,
a control question was posed regarding the color of the car shown in the videos to test the attention of
the users, since the colors of cars were same across all 15 videos. The graphs for each batch obtained via
PeGLearn is shown in Figure 4.9.
Figure 4.8: A frame from one of the survey videos.
φ1 φ2
φ3
δ21
δ31
δ23
(a) Batch 1
φ1 φ2
φ3
δ21
δ31
δ32
(b) Batch 2
φ1 φ2
φ3
(c) Batch 3
Figure 4.9: DAGs for the CARLA simulator experiment.
For this online AMT survey, we initially recruited 150 human participants and took numerous measures
to ensure reliability of results. We posed a control question at the end to test their attention to the task, and
69
eliminated data associated with the wrong answer, including incomplete data, resulting in 146 samples.
All participants had an approval rating over 98% and the demographics are as follows: (i) 73 males, 72
females, 1 other, (ii) participant age ranged from 22 to 79 with an average age of 40.67, and (iii) average
driving experience of 22.4 years. Our survey collected the following information from each participant:
• Participant information: Number of years of driving experience, age, gender and experience with
video games.
• Ratings on a scale of 1 (worst) - 5 (best) for the queries/specifications: (i) driver staying close to the
lane center, (ii) driver maintaining safe distance to lead vehicle(s) and (iii) driver respecting speed
limits of the highway.
• Ratings on a scale of 1 (lowest) - 3 (highest) on the overall driving behavior shown in these 5 videos
and also how the participants would prioritize each of the specifications if they were driving in that
scenario.
This user study was approved by the University of Southern California Institution Review Board on January 10, 2022. The goal of this study was to first ensure that the PeGLearn orderings were similar to expert
orderings and not a random coincidence. The total number of possible orderings for the 3 specifications
is 27 (= 33
), so for each video and participant, we also generated an ordering randomly and uniformly
chosen from the space of 27 orderings. Based on this, we formulate the hypothesis as:
Hypothesis 1. The similarity between human expert and PeGLearn orderings is significantly higher than
that of a random ordering.
Secondly, we wanted to investigate the similarity between existing clustering techniques such as KMeans [17, 82] and our algorithm, with expert rankings. The problem of finding weights for specifications
resembles anomaly detection where the bad demonstrations are outliers and methods such as clustering,
classification or combination of both can be employed [119]. Additionally, the weights for specifications
70
that we seek to learn, indicate the importance of the specifications, which is analogous to importance/rank
of features in classification tasks [46]. Hence, we use K-Means combined with SVM [17, 82] for comparison purposes, which we now refer to as Km+SVM. Clustering is first employed to extract the clusters of
demonstrations based on their corresponding ratings. Note that the set of demonstrations can contain all
good (ones with positive ratings for all specifications), all bad (ones with negative ratings for any specification) or a mixture of both these types. Based on the inertia of the k-means clustering for this data, we
found that k = 2 was optimal for each batch and provided the best fit. Then, SVM was used to classify
the cluster centroids and extract the weights. The magnitude of the weights indicate the relative importance/ranking of each feature (specification) [46]. So we ranked the weights to compare with PeGLearn
rankings. Therefore, we formulate our second hypothesis as follows:
Hypothesis 2. The similarity between human expert and PeGLearn orderings is significantly higher than
that of K-means+SVM.
Analysis We first obtain the ratings and hence specification-orderings from all sources: participants, PeGLearn, Km+SVM and uniform-random algorithm. We then compute the Hamming distance [50]
between the human expert orderings and orderings from (i) PeGLearn, (ii) Km+SVM, and (iii) uniformrandom. The reason being that the Hamming distance between any two sequences of equal lengths measures the number of element-wise disagreements or mismatches, and hence gives an estimate of how close
any two orderings are. The distance is a value normalized in [0, 1] with 0 representing same sequences
and 1 indicating completely different sequences. To perform statistical analysis, we introduce a few notations for convenience, as follows: (i) PH for PeGLearn–human Hamming distance or error, (ii) KH for
Km+SVM–human Hamming error, and (iii) RH for uniform-random–human Hamming error. We concatenate these errors under the name “Score” for analysis purposes. Note that lower the Hamming distance or
“Score”, the more similar are the two orderings. A two-way ANOVA was conducted to examine the effects
71
(a) Batch A (b) Batch B (c) Batch C
Figure 4.10: Comparison of specification orderings between humans and PeGLearn.
of agent type (i.e., {PH, KH, RH}) and batch number (i.e., {1, 2, 3}) on the “Score”. There was no statistically
significant interaction between agent type and batch number for “Score”, F(4, 429) = 1.605, p = .172,
partial η
2 = .015. Therefore, an analysis of the main effect of agent type was performed, which indicated there was a statistically significant main effect of agent type, F(2, 429) = 34.558, p < .001, partial
η
2 = .139. All pairwise comparisons were run, where reported 95% confidence intervals and p-values
are Bonferroni-adjusted. The marginal means for “Score” were .475 (SE = .022) for PH, .724 (SE = .022)
for KH and .660 (SE = .022) for RH. The “Score” means between RH and PH were found to be statistically
significant p < .001, showing support for H1. Similarly, the “Score” means for PH and KH were statistically significant p < .001 and the mean for KH was higher than PH, supporting H2. Lastly, there was
no statistically significant main effect of batch number on “Score”, F(2, 429) = .172, p = .842, partial
η
2 = .001.
The overall orderings from human experts and our PeGLearn algorithm for each AMT survey batch
are shown in Figure 4.10. To compare the ratings, we first normalized all the human and PeGLearn ratings
to be in the range [0, 3]. The user bars correspond to the human expert ratings while auto represents
PeGLearn rating, which is deterministic and hence there are no error bars.
Comparison with Km+SVM K-means typically has a complexity of O(mknt) where m is the number of data points (i.e., demonstrations), k is the number of components/clusters, n is the dimension of data
72
(i.e., number of specifications) and t is the number of iterations. Linear SVM follows linear complexity in
m and so the combination of Km+SVM is still O(mknt). Since there are k = 2 components in our formulation, the k is treated as a constant and the complexity is just O(mnt). Our algorithm on the other hand
has a complexity of O(mn2
) when using matrices to represent graphs. This shows that our algorithm not
only performs better than clustering methods, but is also more efficient because generally, the number
of specifications is much smaller than the number of iterations to converge (n ≪ t). All the experiments
and results show that our method can not only learn accurate rewards similar to the way humans perceive
them, but it does so with a limited number of even imperfect data. Similarly, we also performed experiments on a real-world surgical robot dataset, JIGSAWS, to demonstrate how human Likert ratings can be
used to learn DAGs (described below). Additionally, together with the LfD-STL framework, we are able
to learn temporal-based rewards, even in continuous and high-dimensional spaces with just a handful of
demonstrations.
JIGSAWS (Surgical Robot Dataset) To show the generalizability of our method to performance assessment metrics beyond those induced by temporal logics, we evaluated it on human ratings (e.g., Likert
scale) provided for various robotic surgical tasks [43] performed on a da Vinci Surgical robot system. As
described in the dataset, an expert surgeon provided evaluations or ratings for 8 different surgeons with
various expertise levels on 3 basic surgical tasks - knot-tying, needle-passing and suturing. There are 6
specifications for evaluating the performance of surgeons on the tasks and the ratings are measured on a
scale from 1 (lowest) to 5 (highest) for each specification. The 6 specifications or evaluation criteria are:
1. Respect for tissue (TR) - force exerted on tissue.
2. Suture/needle handling (SNH) - control while tying knots.
3. Time and motion (TM) - fluent motions and efficient.
4. Flow of operation (FO) - planned approach with minimum interruptions between moves.
73
5. Overall performance (P).
6. Quality of final product (Q).
Using this rating scheme, our method was able to generate a performance-graph for each class of
expertise as shown in Figure 4.11 for the knot-tying task. If the expertise levels were unknown, then the
generated graph would be as indicated in Figure 4.11d. We obtained similar performance graphs for the
other 2 surgical tasks - needle-passing and suturing. This shows how human evaluations can be used in
environments such as these where it is difficult even for experts to express the tasks in a formal language.
From the graphs, we can see that all 3 categories of surgeons showed maximum performance on (Q) and
(TM). However, there were differences in the other specifications. For example, experts had a higher rating
of (P) over (TR) compared to intermediates. One possible explanation for this is that experts typically
perform multiple consecutive surgeries, and so they optimize on the (P) and (FO) aspects compared to
(TR), while the intermediate-level surgeons are trainees who are still learning the nuances of surgeries
and are focusing more on the qualitative aspects such as (TR) over quantity and speed. Similar reasoning
can be applied to each category of surgeons using these DAGs.
Remark. We acknowledge that providing individual ratings for every demonstration-specification pair is
tedious since the complexity of manually specifying the performance graph is exponential as elaborated in
the proofs. This is because one needs to take into account, not just the labels or ratings, but also the orderings
(permutations) among those labels. In other words, a user needs to assign O(mn) ratings and also compare
them with different permutations of those ratings, i.e., creating relative priorities to specify the graph, which
is exponential in O(n
2
). Thus, manually defining the graphs results in a very large complexity as shown.
Our method significantly eliminates this complex manual labor by using only minimal inputs from users as
it is much easier to provide individual labels than having to compare with all permutations of the labels. To
even further reduce human inputs, a potential solution we will consider for future work is to use deep temporal
74
TR SNH
TM FO
Q P
(a) Experts
TR SNH
TM FO
Q P
(b) Intermediates
TR SNH
TM FO
Q P
(c) Novices
TR SNH
TM FO
Q P
(d) Cumulative
Figure 4.11: DAGs for the Knot-Tying task. (a)-(c) DAG for each level of expertise: Experts, Intermediates
and Novices respectively. (d) DAG for all surgeons, without discriminating expertise levels.
learning methods to learn from existing labeled data and predict the labels for newer demonstrations. We argue
that some form of human feedback would be necessary to provide formal guarantees in learning rewards since
it provides a ground truth baseline.
4.3 Summary
In this chapter, we presented the novel PeGLearn algorithm to capture the performance and provide intuitive holistic representations of demonstrations in the form of graphs. The motivation behind PeGLearn
was the assumption of user-defined specifications in the prior LfD-STL framework, which would present
users with a huge burden when the number of task specifications was large. We showed, through challenging experiments in robotic domains, that the inferred graphs could be directly applied to the existing
LfD-STL framework to extract rewards and robust control policies via RL, with a limited number of even
imperfect demonstrations. The user study conducted showed that our graph-based method produced more
accurate results than (un)supervised algorithms in terms of similarities with human ratings. We believe this
75
work facilitates the development of interpretable and explainable learning systems with formal guarantees, which is one of the prominent challenges today. Using intuitive structures such as DAGs to represent
rewards and trajectories would provide insights into the learning aspects of RL agents, as to the quality of
behaviors they are learning and can be used alongside/integrated with works in explainable AI.
So far, in Chapter 3 and Chapter 4, we have shown how the reward function and hence the resulting
policy is able to satisfy the task specifications. If safety is encoded in the task specifications, then the
extracted reward function is able to guarantee that the RL agent will not learn bad/undesirable behaviors.
However, this does not guarantee that the reward function is the best possible (or optimal) one. One might
wonder whether there exists a better reward function compared to the currently extracted one, and also
if the task can be solved optimally under the current set of specifications. In Chapter 5, we will explicitly
use the performance graphs to characterize this optimality and help the agent “discover” better reward
functions if feasible.
4.4 Bibliographic Notes
There have been a few neurosymbolic LfD methods that have tackled the problem defined in this chapter. A
counterexample-guided approach using probabilistic computation tree logics for safety-aware apprenticeship learning is proposed in [120], which makes use of DAGs to search for counterexamples. The authors
in [25] utilize model-predictive control to design a controller that imitates demonstrators while deciding
the trade-offs among STL rules. Similar to our prior works, they assume that the priorities or ordering
among the STL specifications are provided beforehand. An active learning approach has been explored
in [79], in which the rewards are learned by augmenting the state space of a Markov Decision Process
(MDP) with an automaton, represented by directed graphs. An alternative approach to characterize the
expressivity of Markovian rewards has been proposed in [4] and discussed in Chapter 3, which could provide interpretations for rewards in terms of task descriptions. Additionally, the methods in [4] have been
76
mainly developed for deterministic systems, while LfD-STL can also generalize to stochastic dynamics and
continuous spaces as seen in our experiments.
Causal influence diagrams of MDPs via DAGs for explainable online-RL have been recently investigated
[37]. Another related work by the authors of [75] make use of causal models to extract causal explanations
of the behavior of model-free RL agents. They represent action influence models via DAGs in which the
nodes are random variables representing the agent’s world and edges correspond to the actions. These
works are mainly focused on explainability in forward RL (i.e., when rewards are already known), whereas
in this thesis, we seek to generate intuitive representations of behaviors and rewards to be used later in
forward RL.
In the area of reward explanations for RL tasks, the method proposed in [62] decomposes rewards into
several components, based on which, the RL agent’s action preferences can be explained and can help in
finding bugs in rewards, as well as shaping rewards. Another work pertaining to IRL [10] uses expertscored trajectories to learn a reward function. This work, which builds on standard IRL, typically relies
on a large dataset containing several hundreds of nearly-optimal demonstrations and hence generating
scores for each of them. By Theorem 4.1.1, our method can also overcome any rank-conflicts arising out
of myopic trajectory preferences [18]. The authors in [103] have investigated the reward-explanation
problem in the context of human-robot teams wherein the robot, via interactions, learns the reward that is
known to the human. They propose 2 categories of reward explanations: (i) feature-space: where rewards
are explained through individual features comprising the reward function and their relative weights, and
(ii) policy-space: where demonstrations of actions under a reward function are used to explain the rewards.
Our work can be regarded as a combination of these categories since it uses specifications as features along
with inferred weights, and demonstrations.
77
An orthogonal research area is learning temporal logic specifications from demonstrations for interpretable and explainable AI (xAI). Learning task specifications from expert demonstrations using the principle of causal entropy and model-based MDPs has been explored by the authors of [113]. In [64], the
authors infer linear temporal logic (LTL) specifications from agent behavior in MDPs as a path to interpretable AL. A method for Bayesian inference of LTL specifications from demonstrations is provided by
the authors of [106]. Learning LTL formulas from demonstrations to explain multi-stage tasks has been
recently explored in [26]. Using the inferred specifications, suitable solvers can be used to generate control
policies that are able to satisfy the specifications, but are computationally very expensive.
78
Chapter 5
Apprenticeship Learning with STL
In teaching a robot various tasks for human assistance and collaboration, safety is a main priority. However,
solely focusing on the safety-enforcing specifications during learning can cause robots to have lower,
although satisfactory, performance on the efficiency-related specifications due to the diligence involved.
That is, the safety specifications may overshadow the robustness of the other soft specifications. Hence,
to enable robots to learn and adapt to various tasks quickly, it is also important to consider efficiency.
However, as discussed in previous chapters, the quality of the RL agent’s performance depends on
the quality of demonstrations. A main drawback of BC and IRL-based algorithms is that they rely on
demonstrations being optimal, which is seldom the case in real-world scenarios. The more recent IRL and
BC-based methods that learn from suboptimal demonstrations [123, 122, 10, 23, 20] measure optimality or
performance based on statistical noise deviation from the true/optimal demonstrations, requiring access
to a large dataset of demonstrations. However, such noise-to-performance measures are extracted empirically and hence lack formal reasoning that can explain the quality of behaviors. Furthermore, in IRL, the
rewards are inherently Markovian, and they do not account for temporal dependencies among subgoals in
demonstrations. Research in reward design [4, 92] discusses the need for non-Markovian reward representations, especially in time-dependent multi-goal RL settings. Such non-Markovian rewards are typically
designed using spilt-MDPs [3] and reward machines [22, 112], which require significantly increasing the
79
state and/or action spaces of the MDPs thereby increasing the space and computational complexity for the
underlying RL algorithms.
To address these limitations, in the previous chapters, we have used temporal logic task specifications
to infer reward functions by evaluating demonstrations. While the enhanced framework can offer assurances in safety of the learned rewards and policy, it does not explicitly reason about the performance of
the learned RL policy. The reason being that LfD-STL is an open-loop reward-inference framework where
the inferred rewards are fixed and do not have feedback to improve based on agent exploration. In this
chapter, we aim to address this issue by using the performance graph as a metric, which we refer to as
the performance-graph advantage (PGA) to guide the RL process. We draw inspiration from apprenticeship
learning (AL) [2], wherein the reward function and RL policy are learned iteratively. We propose the ALSTL framework that extends LfD-STL with closed-loop learning to update the reward function and policy
concurrently. PGA can be interpreted as the quantification of the areas for improvement of the policy, and
is optimized alongside appropriate off-the-shelf RL algorithms. This enables extrapolation beyond suboptimal demonstrations (i.e., reasoning about possibly new and better behaviors that were not demonstrated
before) while still satisfying the task specifications. The key insight of this framework is that a cumulative/collective measure of (multiple) task objectives along with exploration in the neighborhood of observed
behaviors guides the refinement of rewards and policies that can extrapolate beyond demonstrated behaviors.
Our contributions are summarized as follows:
To summarize the contributions, in Section 5.1, we develop AL-STL, a novel extension to the LfD-STL
framework to enable closed-loop RL wherein the reward function and policy are learned concurrently.
We quantify STL-based performance graphs learned via PeGLearn from, Chapter 4, in terms of an advantage function to guide the RL training process, and formally reason about policy improvements when
demonstrations are suboptimal. Finally, in Section 5.2 we evaluate this framework on a variety of robotic
80
tasks involving mobile navigation, fixed-base manipulation and mobile manipulation, and discuss how it
outperforms prior literature.
5.1 AL-STL Framework
We start by first describing the problem of learning rewards and policies that maximally satisfy all the
task specifications. As in prior chapters, for an MDP, we are given: (i) a finite dataset of demonstrations
Ξ = {ξ1, ξ2, · · · , ξm} and (ii) a set of specifications Φ = {φ1, φ2, · · · , φn} unambiguously expressing
the tasks to be performed. The objective is to infer rewards and extract a behavior or control policy for
an agent such that its behavior is at least as good or better than the demonstrations, and maximizes the
satisfaction of the task specifications. In other words, the satisfaction of task specifications are conveyed
through learned reward functions that the reinforcement learning agent seeks to maximize.
More formally, consider a policy π under the reward function R that captures the degree of satisfaction
of Φ. Let τ indicate a trajectory obtained by a rollout of π in an RL episode. Then, our objective is to find
π
∗
, R∗ = argmax
π,R
Eτ∼π
Xn
i=1
ρ(φi
, τ )
5.1.1 Performance Graphs as the Optimization Objective
Since every trajectory τ is characterized by its associated performance DAG Gτ , where the value of a vertex
indicates the robustness for the specification it represents (Chapter 3), the summation term is essentially
the sum of all vertices. We thus define VSτ
.=
Pn
i=1 ν(φi) = Pn
i=1 ρ(φi
, τ ). Then the objective is:
π
∗
, R∗ = argmax
π,R
Eτ∼π(VSτ )
81
An issue with this formulation occurs when there are multiple task specifications, i.e., n > 1. This
results in multi-objective learning, which can introduce conflicting specifications and hence requires optimal trade-offs. For example, in autonomous driving or in robot manipulation, consider the task of reaching
a goal location as quickly as possible while avoiding obstacles. Depending on the obstacle locations, performing highly safe behaviors (i.e., staying as far away from obstacles as possible) might affect the time
to reach the goal. Similarly, a behavior that aims to reach the goal in the least time will likely need to
compromise on its safety robustness. We thus need to find the behaviors that not only maximize the total
robustness, but are also maximally robust to each task specification. We illustrate this with Example 2.
Example 2. Consider a task with three specifications Φ = {φ1, φ2, φ3}, and consider two trajectories τ1 and
τ2 with robustness vectors [3, 0, 1] and [2, 1, 1], respectively. The reward function inferred with τ1 will have
the weight for φ1 dominate φ2 due to the exponential (softmax) component, while the reward function for τ2
will have more uniform weights over all specifications, albeit with a little bias towards φ1 versus others. Thus,
while both have the same VS, τ2 is overall more robust w.r.t. all the task specifications due to better trade-offs.
By this reasoning, it is more desirable to not only maximize the overall sum, but also maximally satisfy
the individual specifications with trade-offs. So, how do we ensure that optimal trade-offs are achieved
while maximizing the main objective? By observation, it is straight-forward to deduce that the sum of absolute pair-wise differences in robustness of specifications must be minimized. This sum is indeed exactly
encoded by the edges of our trajectory DAG (performance graph) formulation, which is a unique characteristic. Recall that the edges between two nodes (specifications) indicate the difference in their robustness
values (performance). We thus capture the optimal trade-offs for a trajectory τ with the sum of all edges in
its corresponding DAG Gτ , which is given by ESτ =
P
e∈Gτ
e; each edge is defined in Chapter 4. Both VS
and ES can be computed in linear time using the same DAG, without additional computational overhead.
One might wonder if merely minimizing ES is sufficient for finding the optimal trade-offs. We provide a
counterargument in Example 3.
82
Example 3. Consider the same task from Example 2, but with two different trajectories τ3 and τ4 with robustness vectors [1, 1, 1] and [−1, −1, −1], respectively. Since all the specifications are equally weighted, the
ES for both trajectories are the same (= 0). But clearly, τ3 is more robust than τ4 due to the higher VS. Furthermore, consider another trajectory τ5 with vectors [−1, 2, −1], whose ES is 6 (i.e., an edge weight is the
pair-wise difference when sorted). Between τ4 and τ5, the RL agent will prefer τ4 due to the lower ES, which is
undesirable.
From both examples, we conclude that the objective is to maximize VS while minimizing ES. Our new
formulation is,
π
∗
, R∗ = argmax
π,R
Eτ∼π(VSτ − ESτ )
As both VS and ES are dependent on each other, this optimization trade-off can be written as:
π
∗
, R∗ = argmax
π,R
Eτ∼π(VSτ − λ · ESτ ) (5.1)
The constant λ ∈ [0, 1) acts as a regularizer to penalize behaviors with dominant specifications as
in Example 2, and is a tunable hyperparameter. The formulation is very intuitive because we want to
extract the optimal DAG which has no edges. Recall that edges are added only if there is a difference
between the node values (i.e., robustness). Ideally, if the policy is optimal, then every rollout has the same
maximum robustness for all Φ and so no edges are created. This representation offers the unique ability of
providing an intuitive graphical representation of behaviors for interpretability as discussed in Chapter 4,
and formulating an optimization problem. As the robustness of each specification is bounded in [−∆, ∆],
the VS for any trajectory is bounded to [−n∆, n∆]. At either limit, the ES is 0, indicating that all nodes
in the resulting global DAG G have equal weights (= 1/n) at the extrema. We will refer to the term
(VSτ − λ · ESτ ) as performance graph advantage (PGA). Analogous to the advantage function in RL, PGA
83
provides information about the scope for improvement (extra possible rewards) under the current reward
function and policy. During RL, this PGA can be either used as a bonus term alongside the episode returns
or augmented in the gradient ascent formulation.
5.1.2 Framework and Algorithm
Task APPRENTICESHIP LEARNING WITH STL MONITORING
STL Specifications
Φ
Demonstrations
Ξ
Frontier F
φ1 φ2
φ3 φ4
φ5
δ12
δ13
δ25
Global Graph G
PeGLearn
with PGA
Update Candidate C
Final Reward R∗
+ Policy π
∗
collect
rollouts
Figure 5.1: AL-STL Framework with Performance-Graph Advantage.
We now describe our proposed framework, shown in Figure 5.1, that closes the RL training loop to
extract both the reward function and policy that optimally satisfy Φ, resembling apprenticeship learning.
The corresponding pseudocode is given in Algorithm 4. Analogous to the replay buffer in RL, we introduce
storage buffers for the reward model: (i) frontier F containing the best episode rollouts of the agent so far
and (ii) candidate C containing the rollouts under the current reward and policy with PGAs. Initially, the
frontier is populated with demonstrations (line 2) from which the global DAG G and hence the reward
function are extracted via PeGLearn (line 5). RL is performed with the learned rewards and each rollout is
associated with its PGA, that is optimized either in the episode returns or in the loss. Upon updating the
policy, multiple rollouts are collected in the candidate buffer (loop on line 8), and the frontier is updated
by comparing the overall PGAs of both the frontier and candidate based on a strategy (line 11) that we
describe in Section 5.1.2.1. This loop, shown by the yellow background in Figure 5.1, continues for a finite
number of cycles or until the frontier can no longer be updated. At this stage, the reward and policy
representing the frontier optimally satisfy Φ, which we discuss in Section 5.1.2.2.
84
Algorithm 4: STL-Guided Apprenticeship Learning
Input: Ξ := set of demonstrations; Φ := set of specifications
Result: R∗
:= reward function; π
∗
:= a policy
1 begin
2 F ← Ξ // Initialize frontier
3 converged = ⊥
4 while ¬converged do
5 R ← PeGLearn(F, Φ) // reward function from rollouts in F
6 C ← ∅ // Initialize candidate
7 π ← perform RL with PGA
// Rollout k trajectories from π and add them to C
8 for i ← 1 to k do
9 τi ← ⟨(st
, at ∼ π(st))⟩
T
t=0
10 C ← C ∪ τi
11 converged ← Update(C, F)
12 return R∗ = R, π∗ = π
5.1.2.1 Frontier Update Strategies
F and C contain rollouts that are associated with their PGAs. We define an operator ⊙ ∈ {min, max, mean},
and therefore, the metrics Fb
.= ⊙{PGA(τ )|τ ∈ F } and Cb
.= ⊙{PGA(τ )|τ ∈ C}. To update the frontier, we
propose the strategic merge operation as:
(a) We first compare whether Cb > Fb, i.e., the trajectories with the newly-explored PGA are better than
the current best trajectories in F. The operator ⊙ acts as the criterion for filtering bad-performing
trajectories.
(b) If so, we retain the trajectories in F ∪C whose PGAs are greater than Fb and discard the others; resulting
trajectories form the new F. Formally, this is given by F ← {τ |PGA(τ ) > Fb, τ ∈ F ∪ C}. That is,
quality of the worst ⊙ criteria-based rollouts in F is improved.
(c) Otherwise, F already has the best trajectories so far and is left unaltered. If the statistic ⊙ for F and C
are similar (i.e., their difference is below some threshold) upon sufficient exploration, then convergence
is achieved.
85
In theory, with unbounded memory, the frontier would be able to keep all the best-performing trajectories. For practical implementations, both buffers are bounded (say p), so we keep the top-p trajectories
in the frontier in our experiments. The strategic merge is not the only way to maintain the buffer, however,
it offers some performance guarantees as we show in Section 5.1.2.2. One could consider a naïve approach
of simply merging all the trajectories in both buffers without any filtering criteria. Alternately, one could
also replace all the trajectories in F with those in C, which also exhibits monotonic improvement in the
RL policy.
5.1.2.2 Policy Improvement Analysis
In order to analyze Algorithm 4 and show policy improvement, we make certain assumptions about the
task and RL models:
(a) The specifications accurately represent the task.
(b) The task can be completed, regardless of optimal behavior, with the given MDP configurations and task
specifications. Our algorithm requires at least one demonstration that can satisfy all specifications, but
is not required to be optimal.
(c) The RL agent always has an active exploration component (stochastic policy or an exploration rate)
to cover the MDP spaces. This not only helps in discovering new policies, but also helps learn more
accurate reward models. Theoretically, with infinite timesteps, the RL agent will have fully explored the
environment spaces to find the optimal policy [109]. In practice, the timesteps are set to a large finite
value for majority coverage of the spaces.
Here, we describe how the strategic merge functionality exhibits policy improvement. From Section 5.1.2.1, the new F contains the set of trajectories given by F = {τ |PGA(τ ) > Fb, τ ∈ F ∪ C}. For the
purpose of this proof, we will consider ⊙ to be the mean. Then, Fb =
P
τ∈F PGA(τ)
|F| and Cb=
P
τ∈C PGA(τ)
|C| . We
86
know that the F is updated in the Update function when Cb > Fb. Let Fb′ be the mean of the intermediate
set F
′ = F ∪ C. Then,
Fb′ =
P
τ∈F′ PGA(τ )
|F′
|
=
P
τ∈F PGA(τ ) + P
τ∈C PGA(τ )
|F| + |C|
=
|F| Fb + |C| Cb
|F| + |C| = Fb +
|C| k
|F| + |C| (5.2)
since Cb> Fb, we can write this as Cb= Fb + k, where k > 0.
Now, let Fb′′ be the new mean after filtering l < (|F| + |C|) rollouts whose PGA ≤ Fb in the merged set F
′
.
Fb′′ =
|F′
| Fb′ − Σ{PGA(τ )|τ ∈ F′
, PGA(τ ) ≤ F}b
|F′
| − l
In the worst case, all l trajectories have PGAs at most Fb.
Fb′′ ≥
|F′
| Fb′ − lFb
|F′
| − l
=
(|F| + |C|)Fb′ − lFb
|F| + |C| − l
Fb′′ ≥ Fb +
|C| k
|F| + |C| − l
(substituting from Equation 5.2) (5.3)
As the cardinalities of both buffers F and C are non-zero, the denominator (|F| + |C| − l) > 0. Thus,
in Equation 5.3, the second term is always positive, which proves that our algorithm improves the policy
and reward in each cycle, under the exploration assumption. A special case of Equation 5.3 is when F is
completely replaced by C, i.e., when all l trajectories belong to F, then l = |F| and so, F inherits the higher
mean from C. The frontier remains unchanged when either the demonstrations or the rollouts in F at the
end of each training cycle are optimal. We can apply similar reasoning to the other operators for ⊙. In the
87
case of max, the frontier’s maximum value will always inherit the maximum (i.e., the best rollouts) from
the candidate. For min, only the least-performance trajectories are discarded and the second-to-least ones
are updated to be the new minimum in F. Since the upper-bound of F is n∆, our method keeps improving
the policy towards this maximum. However, this does not guarantee that the maximum value can always
be achieved due to several factors: conflicting specifications causing trade-offs, environment configuration,
solvability of the MDP under the given specifications, etc.
5.1.2.3 Effect of Affine Transformations to Rewards
In practice RL is sensitive to the hyperparameter settings, environment stochasticity, scales of rewards and
observations, and other algorithmic variances [36]. Hence, in our experiments, we normalize observations
and rewards using affine transforms. Note that applying affine transformations to the reward function
does not alter the optimal policy[85], as we show below.
Lemma 5.1.1. The optimal policy is invariant to affine transformations in the reward function.
Proof Sketch. From [109], we have the definition of the Q function as follows, for the untransformed reward
function R:
Q(s, a)
.= E
"X∞
k=0
γ
k
· R(s, a)t+k+1|St = s, At = a
#
(5.4)
Q(s, a)
.= R(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
Q(s
′
, a′
) (5.5)
We consider two cases of reward function affine transformations in our work: (a) scaling by a positive
constant and (b) shifting by a constant. In both these cases, our objective is to express the new Q function
in terms of the original. Note that we abbreviate R(s, a) to just R for simplicity.
88
Case (a): Scaling R by a positive constant Let the scaled reward function be defined as R′ = c·R, c >
0. The new Q function is then
Q
′
(s, a)
.= E
"X∞
k=0
γ
k
· R
′
t+k+1|St = s, At = a
#
Q
′
(s, a) = E
"X∞
k=0
γ
k
· c · Rt+k+1|St = s, At = a
#
Q
′
(s, a) = c · E
"X∞
k=0
γ
k
· Rt+k+1|St = s, At = a
#
Q
′
(s, a) = c · Q(s, a)
Thus we see that the new Q function scales with the scaling constant.
From Equation 5.5 and by later substituting for Q′
from the above result, we have,
Q
′
(s, a)
.= R
′
(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
Q
′
(s
′
, a′
)
c · Q(s, a) = c · R(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
(c · Q(s
′
, a′
))
c · Q(s, a) = c · R(s, a) + cγX
s
′
P(s, a, s′
) max
a
′
·Q(s
′
, a′
)
Q(s, a) = R(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
·Q(s
′
, a′
)
Thus the Bellman equation holds indicating that the policy is invariant to scaling by a positive constant.
89
Case (b): Shifting R by a constant Let the shifted reward function be defined as R′ = R + c. The new
Q function is then
Q
′
(s, a)
.= E
"X∞
k=0
γ
k
· R
′
t+k+1|St = s, At = a
#
Q
′
(s, a) = E
"X∞
k=0
γ
k
· (Rt+k+1 + c)|St = s, At = a
#
Q
′
(s, a) = E
"X∞
k=0
γ
k
· Rt+k+1|St = s, At = a
#
+
X∞
k=0
γ
k
c
Q
′
(s, a) = Q(s, a) + c
1 − γ
Thus we see that the new Q values get shifted by the constant.
From Equation 5.5 and by later substituting for Q′
from the above result, we have,
Q
′
(s, a)
.= R
′
(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
Q
′
(s
′
, a′
)
Q(s, a) + c
1 − γ
= R(s, a) + c
+ γ
X
s
′
P(s, a, s′
) max
a
′
Q(s
′
, a′
) + c
1 − γ
Q(s, a) + c
1 − γ
= R(s, a) + c
+ γ
X
s
′
P(s, a, s′
) max
a
′
Q(s
′
, a′
)
+ γ
X
s
′
P(s, a, s′
)
c
1 − γ
Q(s, a) + c
1 − γ
= R(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
Q(s
′
, a′
)
+ c +
cγ
1 − γ
Q(s, a) = R(s, a) + γ
X
s
′
P(s, a, s′
) max
a
′
·Q(s
′
, a′
)
90
Thus the Bellman equation holds indicating that the policy is invariant to shifting by a constant.
Therefore, any combination of scaling or shifting does not affect the optimal policy in our work. Similarly, the optimal policy is shown to be invariant towards reward shaping with potential functions [85].
5.2 Experiments
Our proposed framework is evaluated on grid-worlds and a diverse set of robotic simulation tasks (Figure 5.2): (i) reaching a desired pose with the end-effector, (ii) placing an object at a desired location, (iii)
opening doors, (iv) safety-aware mobile navigation and (v) closing cabinets with a mobile manipulator. In
all experiments, the task specifications only monitor the observed states and so, the rewards are a function
of just the states. The STL specifications are evaluated using RTAMT [87]. The reward function is modeled
by regression with either fully connected neural networks or Gaussian processes. All experiments are
performed on an Ubuntu desktop with an Intel®Xeon 8-core CPU and Nvidia Quadro RTX 5000 GPU. For
each environment, m = 5 demonstrations are generated by training an appropriate RL agent under an
expert dense reward function. In these domains, every RL episode features a unique/randomized target and
hence the collected demonstrations are unique (i.e., the states do not overlap. Additionally, these simulations
implicitly model noise in the environment which make it challenging to provide optimal trajectories.). In all
tasks, unless explicitly stated, the frontier is updated by completely replacing its contents with the candidate (i.e., special case of Equation 5.3) and we set |F| = |C| = 5. Furthermore, in all tasks, the trained
policy is evaluated on 5 random seeds, drawn from the baselines for comparisons. For each seed, 20 trials
are performed, thus totaling 100 test scenarios; the mean success rates are then reported.
Task - Discrete-Space Frozenlake We make use of the F rozenlake (FL) deterministic environments
from OpenAI Gym [19] that consist of a grid-world of sizes 4x4 or 8x8 with a reach-avoid task. Informally,
91
(a) Panda Pose Reaching (b) Needle Pose Reaching
(c) Pick-and-Place (d) Door Opening
(e) Reach-Avoid (f) Cabinet Closing
Figure 5.2: Overview of the robot simulation environments. The task in (d) uses the Nvidia Isaac simulator.
92
the task specifications are (i) eventually reaching the goal, (ii) always avoid unsafe regions and (iii) take
as few steps as possible, similar to the ones used for the experimental evaluations in Chapter 3. In these
small environments m = 5 demonstrations of varying optimality are manually generated. We use A2C as
the RL agent and show the training results in Figure 5.3. The left figures show the statistics of the rollout
PGAs and the evolution of weights over time. The right figures show the rewards accumulated and episode
lengths.
We see from the left figures, that initially, the non-uniform weights of specifications correspond to the
suboptimal demonstrations. And over time, the weights all converge to 1/3 indicating that there are no
edges in the final DAG, while the PGAs of rollouts from the final policy are maximum, as hypothesized.
Since the environments are deterministic, the final policy achieves a 100% success rate. As this task can
be achieved even with IRL-based methods, we compare the number of demonstrations required. Under
identical conditions, the minimum number of demonstrations used by MCE-IRL are 50 for 4x4 grid and
300 for 8x8 grid. The algorithms in [120, 5] use over 1000 demonstrations in the 8x8 grid, even though they
use temporal logic specifications similar to ours; this is due to the unsafe regions being scattered over the
map, requiring the desirable dense features to appear very frequently. This clearly shows that the choice of
the reward inference algorithm plays a significant role in sample complexity.
Task - Reaching Pose The end-effector of a Franka Panda robot [42] is required to reach the target pose
as quickly as possible, the specifications for which are given as: φ1 := F(d < δ) and φ2 := G(t < T),
where d is the l
2
-norm of the difference between the end-effector and target poses, δ is a small threshold
to determine success, and T is the desired time in which the target must be achieved. For evaluation on
a more precise environment, we use a surgical robot environment - SURROL [117] that is built on the da
Vinci Surgical Robot Kit [65]. In this common surgical task, a needle is placed on a surface and the goal is to
move the end-effector towards the needle center. The specifications for this task follow the same template
above, however, the threshold is very small, i.e., δ = 0.025, requiring highly precise movements. The
93
(a) FL4x4 Weights (b) FL4x4 Training Summary
(c) FL8x8 Weights (d) FL8x8 Training Summary
Figure 5.3: AL-STL results for the 4x4 and 8x8 Frozenlake environments.
reward function was modeled neural network and the RL agent used SAC [47] with hindsight experience
replay (HER) [9]. To validate reproducibility, the training and evaluation was performed over 5 random
seeds using the same 5 demonstrations. The results for both these environments are shown in Figure 5.4.
The first column shows the PGA over time or cycles (note the scale of y-axis). The learned policies in both
environments achieve have PGA ≈ 2 since there are 2 specifications. The second column represents the
specification weights. In the surgical task, the final weights are uniform as desired due to the small room
for error, while the Panda task has a larger threshold for completion which affects the resolution of the
smooth STL semantics, though all tasks are completed successfully. The hyperparameters for both tasks:
Panda-Reach and Needle-Reach, were nearly identical. The specifications for both these tasks are:
1. Reaching the target pose: φ1 := F(∥eepose − targetpose∥ ≤ δ), where ee indicates the end-effector
and δ is the threshold used to determine success. For Panda-Reach, δ = 0.2 and for Needle-Reach,
δ = 0.025.
94
2. Reaching the target as quickly as possible: φ2 := G(t <= 50), where t is the time when the endeffector reaches the target.
(a) Panda-Reach
(b) Needle-Reach
Figure 5.4: Summary of training and evaluations for the pose-reaching tasks.
In both tasks, using just 5 demonstrations, AL-STL achieved over 99% mean success rate in both, training (right figures) and evaluations. For Needle-Reach, the baselines [117, 56] that used BC and IRL, required
100 expert demonstrations. It is shown in [56] that, when the number of demonstrations is reduced to just
10, which is still 2x larger than ours, the success rate drops drastically. For Panda-Reach, the authors
of [49] show that imitation learning outperforms adversarial IRL techniques when each method uses 50
demonstrations, though both eventually learn to succeed in the task. This however is still 10x more than
the amount of samples required by our work.
Task - Placing Cube A Franka Panda robot is required to pick up a cube on a table (Figure 5.2c) and
place it at the desired location[42]. Only 4 of the 5 demonstrations were successful. The specifications are:
1. Placing the cube at the target pose: φ1 := F(∥cubepose − targetpose∥ ≤ 0.05).
95
2. Reaching the target as quickly as possible: φ2 := G(t <= 50), where t is the time when the endeffector reaches the target.
The specifications indicate that the distance between the cube and desired pose is below a threshold
and the robot must do so as quickly as possible. The RL agent used TQC [68] with HER [9] and achieved
a training success rate of 98% (Figure 5.5), and converges to a high success rate after just 3 cycles. The
resulting policy achieved a success rate of 96% in the test trials. The statistics of the PGA shows that is
maximum value is ≈ 6 since there are 2 specifications, each scaled by a factor of 3. The task specification
is significantly challenging because it only describes that the cube be placed at the desired pose. In other
words, the RL agent must learn the sequence of elementary behaviors: reach, grasp and move to the desired
location while holding the cube, just from the 5 demonstrations. Another remarkable finding in our work
(shown in the supplemental video of [96]), is that the policy learns to (i) correctly pick the cube and place
it at the target whenever the target height is above the table and (ii) push/drag the cube when the target is
on the same table surface. This shows that our algorithm combines RL exploration and graph advantage
to possibly learn specification-satisfying behaviors that were not observed before. Under identical training
conditions, with the exclusion of reward model-specific hyperparameters, the number of demonstrations
used for this task in the baselines that achieved comparable success rates, are: 100 for MCAC [116], between
4 and 16 for OPRIL [55], 20 for goalGAIL [31] and 50 for ROT [49].
(a) RL training summary (b) PGA and Weights (c) Success rates on test trials
Figure 5.5: Summary of training and evaluations for the Cube-Placing task.
96
Task - Opening Door A Panda robot, mounted on a pedestal (Figure 5.2d), is required to open a door[121].
Only 3 of the 5 demonstrations were successful. The task is successful if the door hinge is rotated beyond
θ = 0.3rad. The Panda robot uses operational space control to control the pose of the end-effector. The
horizon for this task is 500 and the control frequency is 20 Hz. The specifications for both these tasks are:
1. Opening the door: φ1 := F(∠door_hinge ≥ 0.3). Angle is measured in radians.
2. Reaching the door handle: φ2 := F(∥ee − door_handle∥ < 0.02); end-effector should be within
2cm of the door handle.
3. Reaching the target as quickly as possible: φ3 := G(t <= 500), where t is the time when at which
the door is opened beyond 0.3 rad.
The elementary behaviors to be learned are: reaching the door handle, turning the handle to unlock
the door and pulling to open the door. This is a non-trivial task for expert reward design as it must capture
all these elementary behaviors and compose them sequentially. Since this is a more challenging task, the
frontier was updated with strategic merge, and the size of reward buffers were set to 20 to collect more
rollouts. The RL agent used TQC and was trained for 25 cycles to achieve a success rate of 98% (Figure 5.6).
In the evaluations, the resulting policy achieved a success rate of 100%.
(a) RL training summary (b) PGA and Weights (c) Success rates on test trials
Figure 5.6: Summary of training and evaluations for the Door-Opening task.
We compare our work with two state-of-the-art baselines MCAC [116] and OPIRL [55], which have
shown to outperform maximum entropy and adversarial IRL-based methods. Under identical training
97
conditions, while both these methods successfully complete this task, MCAC used 100 demonstrations,
while OPIRL used between 4 and 16. OPIRL had significantly more variance (i.e., unstable learning) with
4 demonstrations compared to using 16. Furthermore, in OPRIL, the method uses a substantially large
reward buffer size of 2 · 106
to compensate for the limited demonstrations, while ours uses 2 · 104
(i.e.,
|F| = |C| = 20, each trajectory of length 500). This indeed shows our method is more sample-efficient
compared to IL and IRL.
Task - Safe Mobile Navigation In this task (Figure 5.2e) [60], a mobile robot navigates to the goal
while avoiding hazards (red markers) as much as possible. A cost is incurred for traversing a hazard,
and the objective is to minimize this cost. The distance to the goal and hazards are provided by Lidar
measurements and the observation/state space has 56 dimensions. Specifically, the mobile robot consists
of two independently driven parallel wheels and one free-rolling rear wheel, having similar dynamics
as a TurtleBot. The environment contains 8 hazard markers scattered around the map and a single goal
location. The location of the hazards, goal and robot are randomized for each episode. Traversing any
of the hazards incurs a cost of 1. Due to the map randomization, the optimal policy may not always be
able to ensure complete hazard-avoidance, and must rather minimize this cost. The robot is equipped with
Lidar that provides 16 measurements each for the distances between: (i) robot and goal, and (ii) robot and
nearest hazard. The task specifications are:
1. Reaching the goal: φg := F(
W16
i=1(d
i
g < 0.1)), where d
i
g
is the Lidar’s i-th distance measurement to
the goal.
2. Maintaining safety: φs := G(cost < 1), where cost is the value incurred when the risk-area Lidar
detects that the robot is too close to a hazard. The cost is given by the formula V16
i=1(d
i
l > 0), where
d
i
l
is the risk-Lidar’s i-th distance measurement to the nearest hazard.
98
3. Completing the task within a specific time: φt
:= G(t < T), where T = 1000 is the maximum
episode time.
The RL agent was trained using PPO [105] for 5 · 106
steps over 25 cycles and the training time was
about 20 hours. The evaluations (Figure 5.7) showed 98% task success rate with 28% mean cost. Compared
to expert reward functions [60] and state-of-the-art IL method SIM [54], our method was able to achieve
identical task success and cost rates, with 5x fewer demonstrations and 50% fewer training steps. Furthermore, both specifications φg and φs have a length of 16, indicating that our method is able to effectively
accommodate lengthy specifications.
(a) PGA and Weights (b) Evaluation summary
Figure 5.7: Summary of training and evaluations for the Safe Mobile Navigation task.
Task - Cabinet Closing with Mobile-Manipulator This setup consist of a mobile manipulator (Figure 5.2f) - Franka Emika Panda manipulator arm mounted on a Fetch Robotics Freight mobile robot platform.
The environment consists of a cabinet with an open drawer and a rectangular risk zone. The task for the
mobile-manipulator is to close the drawer while minimizing traversal/entry into the risk zone. The robot
is controlled via its joint-space. The specifications for both these tasks are:
1. Closing the drawer: φg := F(drawery < 0.2). The drawer must be closed (as measured by its
y-axis) within a 0.2 units tolerance.
99
2. Maintaining safety: φs := G(cost < 1), where cost is the value incurred when the risk-area Lidar
detects that the robot is too close to a hazard. The cost is given by the formula V16
i=1(d
i
l > 0), where
d
i
l
is the risk-Lidar’s i-th distance measurement to the nearest hazard.
3. Completing the task within a specific time: φt
:= G(t < T), where T = 192 is the episode horizon.
The observation space consists of 57 dimensions, posing a challenge for reward models. This simulation is built on the RL adaptation of Nvidia Isaac Sim [76], which enables parallel (vectorized) training
environments. Only 4 of the 5 demonstrations succeeded the task. The RL agent was trained using PPO
on 400 parallel instances for 107
steps over 10 cycles. Due to the highly vectorized implementation, the
training was completed within 1.5 hours. It had a 100% success rate and 19% mean cost on the test trials
(Figure 5.8), similar to the policies from expert-designed complex dense reward functions.
(a) Evaluation summary (b) PGA and Weights (c) RL training summary
Figure 5.8: Summary of training and evaluations for the Safe FreightFranka Cabinet Drawer task.
Hyperparameters Settings The hyperparameters for all the tasks are shown in Table 5.1.
5.3 Summary
In this chapter, we developed AL-STL, a novel LfD framework, that utilizes apprenticeship learning and
STL task objectives to infer rewards and policies simultaneously. AL-STL is a significant advancement
over prior LfD-STL by introducing closed-loop learning that iteratively improves the quality of rewards
and policies. We proposed a graph-based optimization formalism, performance graph advantage, which
100
Table 5.1: Hyperparameters for all tasks used to evaluate AL-STL.
PARAMETERS VALUES
Panda-Reach Needle-Reach Panda Pick-Place Panda Door Safe Navigation Safe FreightFranka Drawer
# Demos 5
Reward Model1 FCN [200 → 200] GP (Scale+RBF) FCN [16 → 16 → 16] FCN [64 → 64]
RL
Model SAC+HER TQC+HER TQC PPO
Training Timesteps 2 · 105 2.5 · 105 107 5 · 106 107
# AL-STL Cycles 5 25 10
Policy Network2 Shared [64 → 64] Shared [512 → 512 → 512] Shared [256 → 256] Exclusive [256 → 256] Exclusive [512 → 512]
Learning Rate 3 · 10−4 10−3 3 · 10−4 5 · 10−4
Discount Factor γ 0.95 0.97 0.95 0.99
Learning Starts 100 1000 100 - -
Batch Size 256 2048 256 2500 1024
Polyak Update τ 0.005 0.05 0.5 - -
# Epochs - 15 10
# Rollout Buffer - 5000 76,800
# Envs - 5 400
PGA λ 0.9 0.3 0.5
Training Success Rate 100% 98% 98% - -
Test Success Rate 100% 96% 100% (98%, 29%) (100%, 19%)
Training Time (hours) - 10.75 (2.15/cycle) 6.5 (0.26/cycle) 20 (0.8/cycle) 1.5 (0.15/cycle)
(i) provides a succinct representation of multiple non-Markovian (temporal) task specifications for quantitative and interpretable assessments of agent behaviors, and (ii) guides the agent’s learning process to
maximally satisfy the task specifications and perform optimal trade-offs. Through realistic simulation experiments on mobile and manipulation robotic tasks, we have discussed how our approach outperforms
several state-of-the-art methods in terms of sample and space efficiency.
5.4 Bibliographic Notes
Learning rewards from suboptimal demonstrations via entropy-enabled IRL [123, 122, 41, 111] regard
suboptimal demonstrations as noisy deviations from the optimal statistical model. Such methods treat
these imperfect demonstrations as either outliers or simply discard them, hence requiring access to many
demonstrations. Learning better policies from suboptimal demonstrations has been explored in [20]. This
method injects noise into trajectories to infer a ranking, however it synthetically generates trajectories via
BC which has issues with covariate shift and induces undesirable bias. [23] addresses this by defining a
relation between injected noise and performance. However, this noise-performance relationship is empirically derived and lacks formal reasoning. Score-based IRL [10] uses expert-scored trajectories to learn a
reward function, relying on a large set of nearly-optimal demonstrations and hence generating scores for
101
each of them. Additionally, rewards learned via IRL-based methods are Markovian by nature and typically
suited to single-goal tasks, as we have discussed in Chapter 3.
In the area of LfD with temporal logics, the closest to our work is a counterexample-guided approach
using probabilistic computation tree logics (PCTL) for safety-aware AL [120]. Our work differs from it in
two significant ways: (i) we use STL which is applicable to continuous spaces and offers timed-interval
semantics, which are lacking in PCTL, and (ii) the reward inference algorithm in [120] relies on IRL, while
ours is based on LfD-STL (Chapter 4), which greatly improves sample complexity, accuracy and inference
speed. Trade-offs for multi-objective RL have been explored in [25] by explicitly defining specification
priorities beforehand. Alternate approaches convert specifications to their equivalent automaton and augment it to the MDP states [72, 79, 115]. In our work, we do not alter the MDP structure, thereby avoiding
the drawbacks of increased space and computational complexities of augmented MDPs.
102
Chapter 6
Conclusions and Future Work
This thesis is motivated by the problem of developing data-efficient and robust algorithms for robot learning from demonstrations. It is inspired by the research in neurosymbolic AI that integrates the mechanisms
of neural and symbolic AI/learning methods. In the context of reward inference from demonstrations, we
introduced a novel framework in Chapter 3 that combined human demonstrations and high-level STL specifications to: (i) qualitatively evaluate and rank demonstrations and (ii) infer rewards for an RL agent such
that the learned policy is able to satisfy all specifications. Our framework accounted for the uncertainties
in the environment to define non-Markovian (temporal-based) rewards from suboptimal demonstrations.
Additionally, the framework could also learn and predict rewards in continuous and high-dimensional
spaces.
This thesis then proposed mechanisms for inferring demonstrator preferences based on the performance with respect to the STL task specifications, in Chapter 4. The algorithms were able to capture these
performance measures over multiple specifications in terms of an intuitive graphical structure - performance graphs. These graphs also convey the behaviors being learned by the RL agent in an interpretable
manner.
Finally, in Chapter 5, we proposed a framework to reason about the optimality of the learned reward
functions by using the graphical performance metrics introduced in Chapter 4. Inspired by the literature
103
on apprenticeship learning, we developed an algorithm that utilized the graphical metric to enable the RL
agent learn behaviors that could extrapolate beyond demonstrator performance. Since our framework (i)
introduces only a few additional hyperparameters, (ii) can learn from a handful of even imperfect demonstrations, and (iii) facilitates safer and faster learning guided by specifications, it is appropriate for nonexpert users and real-world applications such as assistive robots in households, healthcare, warehouses,
space exploration rovers, etc.
This dissertation is based on our published manuscripts [93, 94, 95] and the preprint [96].
6.1 Future Work
While this thesis addresses a few critical issues with LfD, there are still many challenges to overcome to
ensure robust and safe deployment of AI-enabled robots. We list some of these areas that the dissertation
has not explored.
Multi-Modal and Diverse Data In our work, the temporal logic specifications are only capable of evaluating time-series signals such as robot proprioception data, and hence do not possess the mechanisms
(i.e., syntax and semantics) to handle visual information. The abundance of visual data, in the form of
videos that can be easily recorded or accessed on the World Wide Web, presents an opportunity to develop
metrics and LfD algorithms that can be applied to such visual inputs. The rapid development in natural
language processing [21] also provides the opportunity to infer high-level task descriptions (temporal logics or symbolic automata) from natural language instructions and demonstrations via interactive learning
[81]. This can be further extended to visual-language models (VLMs) [97, 80] to learn the task from either
visual or language inputs.
Diversity in data [40] is another promising direction that has recently gained popularity in the area of
scenario/data generation. Investigating the generation of diverse demonstrations would facilitate robust
104
LfD techniques by creating “informative” trajectories, as well as help in determining the solvability of
MDPs under the given task specifications. This would also facilitate sim2real transfer learning via domain
randomization, for evaluating LfD performance on robots interacting in the real world.
Multi-Task Generalization The LfD-STL agent learns behaviors that conform to a specific task, including tasks that consist of multiple relevant objectives expressed in STL. Obviously, a robot trained to
open/close a door might fail at a pick-and-place task and vice-versa. To achieve general intelligence, a
robot must be able to learn and adapt to multiple tasks by utilizing the underlying task structures/skills;
this would alleviate the need to re-train robots from start for each new task. Developments in offline
reinforcement learning [70] and meta-learning [39] provide an opportunity to learn temporal logic or
automata-based task structures (e.g., hierarchies for hierarchical RL [14]) that can enable multi-task LfD.
Multi-Agent LfD Teaching multiple-robots to perform similar tasks is often time-consuming, and it
would seem that robots can learn from a fully trained robot introduced to the multi-robot environment.
Such a scenario would typically arise in warehouses in package retrieval, in hospitals and urban autonomous driving where robots need to coordinate to find the optimal plan of actions. By observing
human interactions, multi-agent LfD would be able to effectively learn policies wherein robots coordinate
to achieve a common/high-level goal, while also ensuring safe and robust individual performance.
105
Bibliography
[1] Houssam Abbas, Yash Vardhan Pant, and Rahul Mangharam. “Temporal logic robustness for
general signal classes”. In: Proceedings of the 22nd ACM International Conference on Hybrid
Systems: Computation and Control, HSCC 2019, Montreal, QC, Canada, April 16-18, 2019. Ed. by
Necmiye Ozay and Pavithra Prabhakar. ACM, 2019, pp. 45–56. doi: 10.1145/3302504.3311817.
[2] Pieter Abbeel and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning”.
In: Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff,
Alberta, Canada, July 4-8, 2004. Ed. by Carla E. Brodley. Vol. 69. ACM International Conference
Proceeding Series. ACM, 2004. doi: 10.1145/1015330.1015430.
[3] David Abel, André Barreto, Michael Bowling, Will Dabney, Steven Hansen, Anna Harutyunyan,
Mark K. Ho, Ramana Kumar, Michael L. Littman, Doina Precup, and Satinder Singh. “Expressing
Non-Markov Reward to a Markov Agent”. In: Proceedings of the Conference on Reinforcement
Learning and Decision Making. 2022.
[4] David Abel, Will Dabney, Anna Harutyunyan, Mark K Ho, Michael Littman, Doina Precup, and
Satinder Singh. “On the Expressivity of Markov Reward”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and
J. Wortman Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 7799–7812. url:
https://proceedings.neurips.cc/paper/2021/file/4079016d940210b4ae9ae7d41c4a2065-Paper.pdf.
[5] Mohammad Afzal, Sankalp Gambhir, Ashutosh Gupta, Krishna S, Ashutosh Trivedi, and
Alvaro Velasquez. “LTL-Based Non-Markovian Inverse Reinforcement Learning”. In: Proceedings
of the 2023 International Conference on Autonomous Agents and Multiagent Systems. AAMAS ’23.
London, United Kingdom: International Foundation for Autonomous Agents and Multiagent
Systems, 2023, pp. 2857–2859. isbn: 9781450394321.
[6] Takumi Akazaki and Ichiro Hasuo. “Time Robustness in MTL and Expressivity in Hybrid System
Falsification”. In: Computer Aided Verification - 27th International Conference, CAV 2015, San
Francisco, CA, USA, July 18-24, 2015, Proceedings, Part II. Ed. by Daniel Kroening and
Corina S. Pasareanu. Vol. 9207. Lecture Notes in Computer Science. Springer International
Publishing, 2015, pp. 356–374. doi: 10.1007/978-3-319-21668-3\_21.
106
[7] Derya Aksaray, Austin Jones, Zhaodan Kong, Mac Schwager, and Calin Belta. “Q-Learning for
robust satisfaction of signal temporal logic specifications”. In: 55th IEEE Conference on Decision
and Control, CDC 2016, Las Vegas, NV, USA, December 12-14, 2016. IEEE, 2016, pp. 6565–6570. doi:
10.1109/CDC.2016.7799279.
[8] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané.
“Concrete Problems in AI Safety”. In: CoRR abs/1606.06565 (2016). arXiv: 1606.06565. url:
http://arxiv.org/abs/1606.06565.
[9] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,
Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. “Hindsight Experience
Replay”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. Von Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates,
Inc., 2017. url:
https://proceedings.neurips.cc/paper_files/paper/2017/file/453fadbd8a1a3af50a9df4df899537b5-
Paper.pdf.
[10] Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and Olivier Pietquin. “Score-based
Inverse Reinforcement Learning”. In: Proceedings of the 2016 International Conference on
Autonomous Agents & Multiagent Systems, Singapore, May 9-13, 2016. Ed. by Catholijn M. Jonker,
Stacy Marsella, John Thangarajah, and Karl Tuyls. ACM, 2016, pp. 457–465. url:
http://dl.acm.org/citation.cfm?id=2936991.
[11] Christopher G. Atkeson and Stefan Schaal. “Robot Learning From Demonstration”. In: Proceedings
of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee,
USA, July 8-12, 1997. Ed. by Douglas H. Fisher. Morgan Kaufmann, 1997, pp. 12–20.
[12] Anand Balakrishnan and Jyotirmoy V. Deshmukh. “Structured Reward Shaping using Signal
Temporal Logic specifications”. In: 2019 IEEE/RSJ International Conference on Intelligent Robots
and Systems, IROS 2019, Macau, SAR, China, November 3-8, 2019. IEEE, 2019, pp. 3481–3486. doi:
10.1109/IROS40897.2019.8968254.
[13] Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan,
Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy P. Lillicrap. “Distributed Distributional
Deterministic Policy Gradients”. In: 6th International Conference on Learning Representations, ICLR
2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. 2018. url:
https://openreview.net/forum?id=SyZipzbCb.
[14] Andrew G. Barto and Sridhar Mahadevan. “Recent Advances in Hierarchical Reinforcement
Learning”. In: Discrete Event Dynamic Systems 13 (2003), pp. 41–77.
[15] Ezio Bartocci, Jyotirmoy Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler,
Dejan Nˇıcković, and Sriram Sankaranarayanan. “Specification-Based Monitoring of
Cyber-Physical Systems: A Survey on Theory, Tools and Applications”. In: Lectures on Runtime
Verification: Introductory and Advanced Topics. Ed. by Ezio Bartocci and Yliès Falcone. Cham:
Springer International Publishing, 2018, pp. 135–175.
107
[16] Tarek R. Besold, Artur S. d’Avila Garcez, Sebastian Bader, Howard Bowman, Pedro M. Domingos,
Pascal Hitzler, Kai-Uwe Kühnberger, Luís C. Lamb, Priscila Machado Vieira Lima,
Leo de Penning, Gadi Pinkas, Hoifung Poon, and Gerson Zaverucha. “Neural-Symbolic Learning
and Reasoning: A Survey and Interpretation”. In: Neuro-Symbolic Artificial Intelligence: The State
of the Art. Ed. by Pascal Hitzler and Md. Kamruzzaman Sarker. Vol. 342. Frontiers in Artificial
Intelligence and Applications. IOS Press, 2021, pp. 1–51. doi: 10.3233/FAIA210348.
[17] Christopher M. Bishop. Pattern Recognition and Machine Learning, 5th Edition. Information science
and statistics. Springer, 2007. isbn: 9780387310732. url: https://www.worldcat.org/oclc/71008143.
[18] Erdem Biyik, Dylan P. Losey, Malayandi Palan, Nicholas C. Landolfi, Gleb Shevchuk, and
Dorsa Sadigh. “Learning reward functions from diverse sources of human feedback: Optimally
integrating demonstrations and preferences”. In: Int. J. Robotics Res. 41.1 (2022), pp. 45–67. doi:
10.1177/02783649211041652.
[19] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. “OpenAI Gym”. In: CoRR abs/1606.01540 (2016). arXiv: 1606.01540. url:
http://arxiv.org/abs/1606.01540.
[20] Daniel S. Brown, Wonjoon Goo, and Scott Niekum. “Better-than-Demonstrator Imitation
Learning via Automatically-Ranked Demonstrations”. In: Proceedings of the Conference on Robot
Learning. Ed. by Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura. Vol. 100. Proceedings
of Machine Learning Research. PMLR, Nov. 2020, pp. 330–359. url:
https://proceedings.mlr.press/v100/brown20a.html.
[21] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners”. In: Advances in
Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan,
and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64aPaper.pdf.
[22] Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and
Sheila A. McIlraith. “LTL and Beyond: Formal Languages for Reward Function Specification in
Reinforcement Learning”. In: Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence
Organization, July 2019, pp. 6065–6073. doi: 10.24963/ijcai.2019/840.
[23] Letian Chen, Rohan R. Paleja, and Matthew C. Gombolay. “Learning from Suboptimal
Demonstration via Self-Supervised Reward Regression”. In: 4th Conference on Robot Learning,
CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA. Ed. by Jens Kober,
Fabio Ramos, and Claire J. Tomlin. Vol. 155. Proceedings of Machine Learning Research. PMLR,
2020, pp. 1262–1277. url: https://proceedings.mlr.press/v155/chen21b.html.
108
[24] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha S. Srinivasa. “Planning
with Trust for Human-Robot Collaboration”. In: Proceedings of the 2018 ACM/IEEE International
Conference on Human-Robot Interaction, HRI 2018, Chicago, IL, USA, March 05-08, 2018. Ed. by
Takayuki Kanda, Selma Sabanovic, Guy Hoffman, and Adriana Tapus. ACM, 2018, pp. 307–315.
doi: 10.1145/3171221.3171264.
[25] Kyunghoon Cho and Songhwai Oh. “Learning-Based Model Predictive Control Under Signal
Temporal Logic Specifications”. In: 2018 IEEE International Conference on Robotics and
Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018. IEEE, 2018, pp. 7322–7329. doi:
10.1109/ICRA.2018.8460811.
[26] Glen Chou, Necmiye Ozay, and Dmitry Berenson. “Explaining Multi-stage Tasks by Learning
Temporal Logic Formulas from Suboptimal Demonstrations”. In: Robotics: Science and Systems
XVI, Virtual Event / Corvalis, Oregon, USA, July 12-16, 2020. Ed. by Marc Toussaint,
Antonio Bicchi, and Tucker Hermans. 2020. doi: 10.15607/RSS.2020.XVI.097.
[27] Alessandro Cimatti, Luca Geatti, Nicola Gigante, Angelo Montanari, and Stefano Tonetta.
“Reactive Synthesis from Extended Bounded Response LTL Specifications”. In: 2020 Formal
Methods in Computer Aided Design, FMCAD 2020, Haifa, Israel, September 21-24, 2020. IEEE, 2020,
pp. 83–92. doi: 10.34727/2020/isbn.978-3-85448-042-6\_15.
[28] R. De Maesschalck, D. Jouan-Rimbaud, and D.L. Massart. “The Mahalanobis distance”. In:
Chemometrics and Intelligent Laboratory Systems 50.1 (2000), pp. 1–18. issn: 0169-7439. doi:
https://doi.org/10.1016/S0169-7439(99)00047-7.
[29] Jyotirmoy V. Deshmukh, Alexandre Donzé, Shromona Ghosh, Xiaoqing Jin, Garvit Juniwal, and
Sanjit A. Seshia. “Robust Online Monitoring of Signal Temporal Logic”. In: Formal Methods in
System Design 51.1 (2017), pp. 5–30. issn: 1572-8102. doi: 10.1007/s10703-017-0286-7.
[30] Akshay Dhonthi, Philipp Schillinger, Leonel Dario Rozo, and Daniele Nardi. “Study of Signal
Temporal Logic Robustness Metrics for Robotic Tasks Optimization”. In: CoRR abs/2110.00339
(2021). arXiv: 2110.00339. url: https://arxiv.org/abs/2110.00339.
[31] Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. “Goal-conditioned Imitation
Learning”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates,
Inc., 2019. url:
https://proceedings.neurips.cc/paper_files/paper/2019/file/c8d3a760ebab631565f8509d84b3b3f1-
Paper.pdf.
[32] Alexandre Donzé. “Breach, A Toolbox for Verification and Parameter Synthesis of Hybrid
Systems”. In: Computer Aided Verification, 22nd International Conference, CAV 2010, Edinburgh,
UK, July 15-19, 2010. Proceedings. Ed. by Tayssir Touili, Byron Cook, and Paul B. Jackson.
Vol. 6174. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 167–170.
doi: 10.1007/978-3-642-14295-6\_17.
109
[33] Alexandre Donzé, Xiaoqing Jin, Jyotirmoy V. Deshmukh, and Sanjit A. Seshia. “Automotive
Systems Requirement Mining Using Breach”. In: American Control Conference, ACC 2015, Chicago,
IL, USA, July 1-3, 2015. IEEE, July 2015, p. 4097. doi: 10.1109/ACC.2015.7171970.
[34] Alexandre Donzé and Oded Maler. “Robust Satisfaction of Temporal Logic over Real-Valued
Signals”. In: Formal Modeling and Analysis of Timed Systems - 8th International Conference,
FORMATS 2010, Klosterneuburg, Austria, September 8-10, 2010. Proceedings. Ed. by
Krishnendu Chatterjee and Thomas A. Henzinger. Vol. 6246. Lecture Notes in Computer Science.
Springer, 2010, pp. 92–106. doi: 10.1007/978-3-642-15297-9\_9.
[35] Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun.
“CARLA: An Open Urban Driving Simulator”. In: 1st Annual Conference on Robot Learning, CoRL
2017, Mountain View, California, USA, November 13-15, 2017, Proceedings. Vol. 78. Proceedings of
Machine Learning Research. PMLR, 2017, pp. 1–16. url:
http://proceedings.mlr.press/v78/dosovitskiy17a.html.
[36] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. “Benchmarking Deep
Reinforcement Learning for Continuous Control”. In: Proceedings of The 33rd International
Conference on Machine Learning. Ed. by Maria Florina Balcan and Kilian Q. Weinberger. Vol. 48.
Proceedings of Machine Learning Research. New York, New York, USA: PMLR, June 2016,
pp. 1329–1338. url: https://proceedings.mlr.press/v48/duan16.html.
[37] Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. “Reward tampering
problems and solutions in reinforcement learning: a causal influence diagram perspective”. In:
Synthese 198.27 (May 2021), pp. 6435–6467. doi: 10.1007/s11229-021-03141-4.
[38] Georgios E. Fainekos and George J. Pappas. “Robustness of temporal logic specifications for
continuous-time signals”. In: Theoretical Computer Science 410.42 (2009), pp. 4262–4291. doi:
10.1016/j.tcs.2009.06.021.
[39] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks”. In: Proceedings of the 34th International Conference on Machine
Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning
Research. PMLR, Aug. 2017, pp. 1126–1135. url: https://proceedings.mlr.press/v70/finn17a.html.
[40] Matthew Fontaine and Stefanos Nikolaidis. “Differentiable Quality Diversity”. In: Advances in
Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang,
and J. Wortman Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 10040–10052. url:
https://proceedings.neurips.cc/paper_files/paper/2021/file/532923f11ac97d3e7cb0130315b067dcPaper.pdf.
[41] Justin Fu, Katie Luo, and Sergey Levine. “Learning Robust Rewards with Adversarial Inverse
Reinforcement Learning”. In: 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. 2018. url:
https://openreview.net/forum?id=rkHywl-A-.
110
[42] Quentin Gallouédec, Nicolas Cazin, Emmanuel Dellandréa, and Liming Chen. “panda-gym:
Open-Source Goal-Conditioned Environments for Robotic Learning”. In: 4th Robot Learning
Workshop: Self-Supervised and Lifelong Learning at NeurIPS (2021).
[43] Yixin Gao, S. Vedula, C. Reiley, N. Ahmidi, Balakrishnan Varadarajan, Henry C. Lin, L. Tao,
L. Zappella, B. Béjar, D. Yuh, C. C. Chen, R. Vidal, S. Khudanpur, and Gregory Hager. “JHU-ISI
Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human
Motion Modeling”. In: MICCAI workshop: M2CAI. Vol. 3. 2014, p. 3.
[44] Artur d’Avila Garcez and Luís C. Lamb. “Neurosymbolic AI: the 3rd wave”. In: Artificial
Intelligence Review 56.11 (Nov. 2023), pp. 12387–12406. issn: 1573-7462. doi:
10.1007/s10462-023-10448-w.
[45] David Gundana and Hadas Kress-Gazit. “Event-Based Signal Temporal Logic Synthesis for Single
and Multi-Robot Tasks”. In: IEEE Robotics and Automation Letters 6.2 (2021), pp. 3687–3694. doi:
10.1109/LRA.2021.3064220.
[46] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. “Gene Selection for Cancer
Classification using Support Vector Machines”. In: Mach. Learn. 46.1-3 (2002), pp. 389–422. doi:
10.1023/A:1012487302797.
[47] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor”. In: International
Conference on Machine Learning (ICML) (2018).
[48] Iman Haghighi, Noushin Mehdipour, Ezio Bartocci, and Calin Belta. “Control from Signal
Temporal Logic Specifications with Smooth Cumulative Quantitative Semantics”. In: 2019 IEEE
58th Conference on Decision and Control (CDC). 2019, pp. 4361–4366. doi:
10.1109/CDC40024.2019.9029429.
[49] Siddhant Haldar, Vaibhav Mathur, Denis Yarats, and Lerrel Pinto. “Watch and Match:
Supercharging Imitation with Regularized Optimal Transport”. In: CoRL (2022).
[50] R. W. Hamming. “Error detecting and error correcting codes”. In: The Bell System Technical
Journal 29.2 (1950), pp. 147–160. doi: 10.1002/j.1538-7305.1950.tb00463.x.
[51] Hado van Hasselt. “Double Q-learning”. In: Advances in Neural Information Processing Systems 23:
24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting
held 6-9 December 2010, Vancouver, British Columbia, Canada. Ed. by John D. Lafferty,
Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta. Vol. 23.
Curran Associates, Inc., 2010, pp. 2613–2621. url: https:
//proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html.
[52] Jonathan Ho and Stefano Ermon. “Generative Adversarial Imitation Learning”. In: Advances in
Neural Information Processing Systems. Ed. by D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and
R. Garnett. Vol. 29. Curran Associates, Inc., 2016. url:
https://proceedings.neurips.cc/paper_files/paper/2016/file/cc7e2b878868cbae992d1fb743995d8fPaper.pdf.
111
[53] Mark K. Ho, Michael L. Littman, James MacGlashan, Fiery Cushman, and Joseph L. Austerweil.
“Showing versus doing: Teaching by demonstration”. In: Advances in Neural Information
Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December
5-10, 2016, Barcelona, Spain. Ed. by Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg,
Isabelle Guyon, and Roman Garnett. 2016, pp. 3027–3035. url: https:
//proceedings.neurips.cc/paper/2016/hash/b5488aeff42889188d03c9895255cecc-Abstract.html.
[54] Huy Hoang, Tien Mai, and Pradeep Varakantham. “Imitate the Good and Avoid the Bad: An
Incremental Approach to Safe Reinforcement Learning”. In: AAAI. 2024.
[55] Hana Hoshino, Kei Ota, Asako Kanezaki, and Rio Yokota. “OPIRL: Sample efficient off-policy
inverse reinforcement learning via distribution matching”. In: 2022 International Conference on
Robotics and Automation (ICRA). IEEE. 2022, pp. 448–454.
[56] Tao Huang, Kai Chen, Bin Li, Yun-Hui Liu, and Qi Dou. “Demonstration-guided reinforcement
learning with efficient exploration for task automation of surgical robot”. In: Proc. IEEE Int. Conf.
Robot. Automat. 2023.
[57] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. “Imitation Learning:
A Survey of Learning Methods”. In: ACM Comput. Surv. 50.2 (Apr. 2017).
[58] Craig Innes and Subramanian Ramamoorthy. “Elaborating on Learned Demonstrations with
Temporal Logic Specifications”. In: Robotics: Science and Systems XVI, Virtual Event / Corvalis,
Oregon, USA, July 12-16, 2020. Ed. by Marc Toussaint, Antonio Bicchi, and Tucker Hermans. 2020.
doi: 10.15607/RSS.2020.XVI.004.
[59] Stefan Jaksic, Ezio Bartocci, Radu Grosu, Thang Nguyen, and Dejan Nickovic. “Quantitative
monitoring of STL with edit distance”. In: Formal Methods in System Design 53.1 (Aug. 2018),
pp. 83–112. issn: 1572-8102. doi: 10.1007/s10703-018-0319-x.
[60] Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng,
Yifan Zhong, Josef Dai, and Yaodong Yang. “Safety Gymnasium: A Unified Safe Reinforcement
Learning Benchmark”. In: NeurIPS Datasets and Benchmarks Track. 2023.
[61] Yuqian Jiang, Suda Bharadwaj, Bo Wu, Rishi Shah, Ufuk Topcu, and Peter Stone.
“Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks”. In:
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on
Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational
Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 2021,
pp. 7995–8003. url: https://ojs.aaai.org/index.php/AAAI/article/view/16975.
[62] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez. “Explainable Reinforcement
Learning via Reward Decomposition”. In: In proceedings at the International Joint Conference on
Artificial Intelligence. A Workshop on Explainable Artificial Intelligence. 2019.
[63] D. Kahneman. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. isbn: 9780374275631. url:
https://books.google.com/books?id=SHvzzuCnuv8C.
112
[64] Daniel Kasenberg and Matthias Scheutz. “Interpretable apprenticeship learning with temporal
logic specifications”. In: 56th IEEE Annual Conference on Decision and Control, CDC 2017,
Melbourne, Australia, December 12-15, 2017. IEEE, 2017, pp. 4914–4921. doi:
10.1109/CDC.2017.8264386.
[65] Peter Kazanzides, Zihan Chen, Anton Deguet, Gregory S. Fischer, Russell H. Taylor, and
Simon P. DiMaio. “An Open-Source Research Kit for the da Vinci Surgical System”. In: IEEE Intl.
Conf. on Robotics and Auto. (ICRA). Hong Kong, China, June 1, 2014, pp. 6434–6439.
[66] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. url:
http://arxiv.org/abs/1412.6980.
[67] Hadas Kress-Gazit, Kerstin Eder, Guy Hoffman, Henny Admoni, Brenna Argall, Rüdiger Ehlers,
Christoffer Heckman, Nils Jansen, Ross A. Knepper, Jan Křetínský, Shelly Levy-Tzedek, Jamy Li,
Todd D. Murphey, Laurel D. Riek, and Dorsa Sadigh. “Formalizing and Guaranteeing
Human-Robot Interaction”. In: Commun. ACM 64.9 (2021), pp. 78–84. issn: 0001-0782. doi:
10.1145/3433637.
[68] Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. “Controlling
Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics”. In:
Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and
Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, July 2020,
pp. 5556–5566. url: https://proceedings.mlr.press/v119/kuznetsov20a.html.
[69] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq,
Laurent Orseau, and Shane Legg. “AI Safety Gridworlds”. In: CoRR abs/1711.09883 (2017). arXiv:
1711.09883. url: http://arxiv.org/abs/1711.09883.
[70] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning:
Tutorial, Review, and Perspectives on Open Problems. 2020. arXiv: 2005.01643 [cs.LG].
[71] Xiao Li, Yao Ma, and Calin Belta. “A Policy Search Method For Temporal Logic Specified
Reinforcement Learning Tasks”. In: 2018 Annual American Control Conference, ACC 2018,
Milwaukee, WI, USA, June 27-29, 2018. IEEE, 2018, pp. 240–245. doi: 10.23919/ACC.2018.8431181.
[72] Xiao Li, Yao Ma, and Calin Belta. “Automata Guided Reinforcement Learning With
Demonstrations”. In: CoRR abs/1809.06305 (2018). arXiv: 1809.06305. url:
http://arxiv.org/abs/1809.06305.
[73] Xiao Li, Cristian Ioan Vasile, and Calin Belta. “Reinforcement learning with temporal logic
rewards”. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017,
Vancouver, BC, Canada, September 24-28, 2017. IEEE, 2017, pp. 3834–3839. doi:
10.1109/IROS.2017.8206234.
113
[74] Matteo Lucchi, Friedemann Zindler, Stephan Mühlbacher-Karrer, and Horst Pichler. “robo-gym -
An Open Source Toolkit for Distributed Deep Reinforcement Learning on Real and Simulated
Robots”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las
Vegas, NV, USA, October 24, 2020 - January 24, 2021. IEEE, 2020, pp. 5364–5371. doi:
10.1109/IROS45743.2020.9340956.
[75] Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere. “Explainable Reinforcement
Learning through a Causal Lens”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence,
AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI
2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020,
New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 2493–2500. url:
https://ojs.aaai.org/index.php/AAAI/article/view/5631.
[76] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey,
Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State.
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. 2021.
[77] Oded Maler and Dejan Nickovic. “Monitoring Temporal Properties of Continuous Signals”. In:
Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, Joint International
Conferences on Formal Modelling and Analysis of Timed Systems, FORMATS 2004 and Formal
Techniques in Real-Time and Fault-Tolerant Systems, FTRTFT 2004, Grenoble, France, September
22-24, 2004, Proceedings. Ed. by Yassine Lakhnech and Sergio Yovine. Vol. 3253. Lecture Notes in
Computer Science. Berlin, Heidelberg: Springer, 2004, pp. 152–166. isbn: 978-3-540-30206-3. doi:
10.1007/978-3-540-30206-3\_12.
[78] Gary Marcus. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. 2020. arXiv:
2002.06177 [cs.AI].
[79] Farzan Memarian, Zhe Xu, Bo Wu, Min Wen, and Ufuk Topcu. “Active Task-Inference-Guided
Deep Inverse Reinforcement Learning”. In: 59th IEEE Conference on Decision and Control, CDC
2020, Jeju Island, South Korea, December 14-18, 2020. IEEE, 2020, pp. 1932–1938. doi:
10.1109/CDC42340.2020.9304190.
[80] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and
Josef Sivic. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
Narrated Video Clips”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV). Oct. 2019.
[81] Sara Mohammadinejad, Jesse Thomason, and Jyotirmoy V. Deshmukh. Interactive Learning from
Natural Language and Demonstrations using Signal Temporal Logic. 2022. arXiv: 2207.00627
[cs.FL].
[82] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. Adaptive computation and
machine learning series. MIT Press, 2012. isbn: 0262018020.
114
[83] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel.
“Overcoming Exploration in Reinforcement Learning with Demonstrations”. In: 2018 IEEE
International Conference on Robotics and Automation (ICRA). Brisbane, Australia: IEEE Press, 2018,
pp. 6292–6299. doi: 10.1109/ICRA.2018.8463162.
[84] Engineering National Academies of Sciences and Medicine. Human-AI Teaming: State-of-the-Art
and Research Needs. Washington, DC: The National Academies Press, 2022. isbn:
978-0-309-27017-5. doi: 10.17226/26355.
[85] Andrew Y. Ng, Daishi Harada, and Stuart Russell. “Policy Invariance Under Reward
Transformations: Theory and Application to Reward Shaping”. In: Proceedings of the Sixteenth
International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999. Ed. by
Ivan Bratko and Saso Dzeroski. Morgan Kaufmann, 1999, pp. 278–287.
[86] Andrew Y. Ng and Stuart Russell. “Algorithms for Inverse Reinforcement Learning”. In:
Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford
University, Stanford, CA, USA, June 29 - July 2, 2000. Ed. by Pat Langley. Morgan Kaufmann, 2000,
pp. 663–670.
[87] Dejan Nickovic and Tomoya Yamaguchi. “RTAMT: Online Robustness Monitors from STL”. In:
Automated Technology for Verification and Analysis - 18th International Symposium, ATVA 2020,
Hanoi, Vietnam, October 19-23, 2020, Proceedings. Ed. by Dang Van Hung and Oleg Sokolsky.
Vol. 12302. Lecture Notes in Computer Science. Springer, 2020, pp. 564–571. doi:
10.1007/978-3-030-59152-6\_34.
[88] Geoff Norman. “Likert scales, levels of measurement and the “laws” of statistics”. In: Advances in
Health Sciences Education 15 (2010), pp. 625–632.
[89] Masahiro Ono, Brian C. Williams, and Lars Blackmore. “Probabilistic Planning for Continuous
Dynamic Systems under Bounded Risk”. In: J. Artif. Intell. Res. 46 (2013), pp. 511–577. doi:
10.1613/jair.3893.
[90] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and
Jan Peters. “An Algorithmic Perspective on Imitation Learning”. In: Foundations and Trends in
Robotics 7.1-2 (2018), pp. 1–179. doi: 10.1561/2300000053.
[91] Rohan R. Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and
Matthew C. Gombolay. “The Utility of Explainable AI in Ad Hoc Human-Machine Teaming”. In:
Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information
Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by Marc’Aurelio Ranzato,
Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan. 2021,
pp. 610–623. url: https:
//proceedings.neurips.cc/paper/2021/hash/05d74c48b5b30514d8e9bd60320fc8f6-Abstract.html.
[92] Silviu Pitis, Duncan Bailey, and Jimmy Ba. “Rational Multi-Objective Agents Must Admit
Non-Markov Reward Representations”. In: NeurIPS ML Safety Workshop. 2022. url:
https://openreview.net/forum?id=MNwA4sgzR4W.
115
[93] Aniruddh Puranic, Jyotirmoy Deshmukh, and Stefanos Nikolaidis. “Learning from
Demonstrations using Signal Temporal Logic”. In: Proceedings of the 2020 Conference on Robot
Learning. Vol. 155. Proceedings of Machine Learning Research. PMLR, 2021, pp. 2228–2242.
[94] Aniruddh G. Puranic, Jyotirmoy V. Deshmukh, and Stefanos Nikolaidis. “Learning From
Demonstrations Using Signal Temporal Logic in Stochastic and Continuous Domains”. In: IEEE
Robotics and Automation Letters 6.4 (2021), pp. 6250–6257. doi: 10.1109/LRA.2021.3092676.
[95] Aniruddh G. Puranic, Jyotirmoy V. Deshmukh, and Stefanos Nikolaidis. “Learning Performance
Graphs From Demonstrations via Task-Based Evaluations”. In: IEEE Robotics and Automation
Letters 8.1 (2023), pp. 336–343. doi: 10.1109/LRA.2022.3226072.
[96] Aniruddh G. Puranic, Jyotirmoy V. Deshmukh, and Stefanos Nikolaidis. Signal Temporal
Logic-Guided Apprenticeship Learning. 2023. arXiv: 2311.05084 [cs.RO].
[97] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.
“Learning Transferable Visual Models From Natural Language Supervision”. In: Proceedings of the
38th International Conference on Machine Learning. Ed. by Marina Meila and Tong Zhang. Vol. 139.
Proceedings of Machine Learning Research. PMLR, July 2021, pp. 8748–8763. url:
https://proceedings.mlr.press/v139/radford21a.html.
[98] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and
Noah Dormann. “Stable-Baselines3: Reliable Reinforcement Learning Implementations”. In:
Journal of Machine Learning Research 22.268 (2021), pp. 1–8. url:
http://jmlr.org/papers/v22/20-1364.html.
[99] Harish Ravichandar, Athanasios S. Polydoros, Sonia Chernova, and Aude Billard. “Recent
Advances in Robot Learning from Demonstration”. In: Annual Review of Control, Robotics, and
Autonomous Systems 3.1 (2020), pp. 297–330.
[100] Alëna Rodionova, Ezio Bartocci, Dejan Nickovic, and Radu Grosu. “Temporal Logic as Filtering”.
In: Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control,
HSCC 2016, Vienna, Austria, April 12-14, 2016. Ed. by Alessandro Abate and Georgios Fainekos.
ACM, 2016, pp. 11–20. doi: 10.1145/2883817.2883839.
[101] Stephane Ross, Geoffrey Gordon, and Drew Bagnell. “A Reduction of Imitation Learning and
Structured Prediction to No-Regret Online Learning”. In: Proceedings of the Fourteenth
International Conference on Artificial Intelligence and Statistics. Ed. by Geoffrey Gordon,
David Dunson, and Miroslav Dudík. Vol. 15. Proceedings of Machine Learning Research. Fort
Lauderdale, FL, USA: PMLR, Apr. 2011, pp. 627–635. url:
https://proceedings.mlr.press/v15/ross11a.html.
[102] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach (4th Edition). Pearson,
2020. isbn: 9780134610993. url: http://aima.cs.berkeley.edu/.
[103] Lindsay Sanneman and Julie A. Shah. “An Empirical Study of Reward Explanations With
Human-Robot Interaction Applications”. In: IEEE Robotics and Automation Letters 7.4 (2022),
pp. 8956–8963. doi: 10.1109/LRA.2022.3189441.
116
[104] Stefan Schaal. “Learning from Demonstration”. In: Advances in Neural Information Processing
Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996. Ed. by Michael Mozer, Michael I. Jordan,
and Thomas Petsche. MIT Press, 1996, pp. 1040–1046. url:
http://papers.nips.cc/paper/1224-learning-from-demonstration.
[105] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy
Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. url:
http://arxiv.org/abs/1707.06347.
[106] Ankit Shah, Pritish Kamath, Julie A. Shah, and Shen Li. “Bayesian Inference of Temporal Task
Specifications from Demonstrations”. In: Advances in Neural Information Processing Systems 31:
Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montréal, Canada. Ed. by Samy Bengio, Hanna M. Wallach, Hugo Larochelle,
Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett. NIPS’18. Montréal, Canada: Curran
Associates Inc., 2018, pp. 3808–3817. url: https:
//proceedings.neurips.cc/paper/2018/hash/13168e6a2e6c84b4b7de9390c0ef5ec5-Abstract.html.
[107] Simone Silvetti, Laura Nenzi, Ezio Bartocci, and Luca Bortolussi. “Signal Convolution Logic”. In:
Automated Technology for Verification and Analysis - 16th International Symposium, ATVA 2018,
Los Angeles, CA, USA, October 7-10, 2018, Proceedings. Ed. by Shuvendu K. Lahiri and Chao Wang.
Vol. 11138. Lecture Notes in Computer Science. Springer International Publishing, 2018,
pp. 267–283. doi: 10.1007/978-3-030-01090-4\_16.
[108] Halit Bener Suay, Tim Brys, Matthew E. Taylor, and Sonia Chernova. “Learning from
Demonstration for Shaping through Inverse Reinforcement Learning”. In: Proceedings of the 2016
International Conference on Autonomous Agents & Multiagent Systems, Singapore, May 9-13, 2016.
Ed. by Catholijn M. Jonker, Stacy Marsella, John Thangarajah, and Karl Tuyls. ACM, 2016,
pp. 429–437. url: http://dl.acm.org/citation.cfm?id=2936988.
[109] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Second.
Adaptive Computation and Machine Learning Series. Cambridge, MA: The MIT Press, 2018.
[110] Faraz Torabi, Garrett Warnell, and Peter Stone. “Behavioral Cloning from Observation”. In:
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI
2018, July 13-19, 2018, Stockholm, Sweden. Ed. by Jérôme Lang. ijcai.org, 2018, pp. 4950–4957. doi:
10.24963/ijcai.2018/687.
[111] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative Adversarial Imitation from Observation.
2019. arXiv: 1807.06158 [cs.LG].
[112] Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. “Reward
Machines: Exploiting Reward Function Structure in Reinforcement Learning”. In: J. Artif. Int. Res.
73 (May 2022). issn: 1076-9757. doi: 10.1613/jair.1.12440.
117
[113] Marcell Vazquez-Chanlatte and Sanjit A. Seshia. “Maximum Causal Entropy Specification
Inference from Demonstrations”. In: Computer Aided Verification - 32nd International Conference,
CAV 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part II. Ed. by Shuvendu K. Lahiri
and Chao Wang. Vol. 12225. Lecture Notes in Computer Science. Springer, 2020, pp. 255–278. doi:
10.1007/978-3-030-53291-8\_15.
[114] Min Wen, Rüdiger Ehlers, and Ufuk Topcu. “Correct-by-synthesis reinforcement learning with
temporal logic constraints”. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and
Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015. IEEE, 2015, pp. 4983–4990.
doi: 10.1109/IROS.2015.7354078.
[115] Min Wen, Ivan Papusha, and Ufuk Topcu. “Learning from Demonstrations with High-Level Side
Information”. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial
Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. Ed. by Carles Sierra. ijcai.org,
2017, pp. 3055–3061. doi: 10.24963/ijcai.2017/426.
[116] Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane, Daniel Brown, and
Ken Goldberg. “Monte carlo augmented actor-critic for sparse reward deep reinforcement
learning from suboptimal demonstrations”. In: Advances in Neural Information Processing Systems
35 (2022), pp. 2254–2267.
[117] Jiaqi Xu, Bin Li, Bo Lu, Yun-Hui Liu, Qi Dou, and Pheng-Ann Heng. “SurRoL: An Open-source
Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning”.
In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2021.
[118] Tomoya Yamaguchi, Tomoyuki Kaga, Shunsuke Kobuna, James Kapinski, Xiaoqing Jin,
Jyotirmoy Deshmukh, Hisahiro Ito, Alexandre Donze, and Sanjit Seshia. “ST-Lib: A Library for
Specifying and Classifying Model Behaviors”. In: SAE 2016 World Congress and Exhibition.
2016-01-0621. Warrendale, PA: SAE International, Apr. 2016. doi:
https://doi.org/10.4271/2016-01-0621.
[119] Bichen Zheng, Sang Won Yoon, and Sarah S. Lam. “Breast cancer diagnosis based on feature
extraction using a hybrid of K-means and support vector machine algorithms”. In: Expert Systems
with Applications 41.4 (2014), pp. 1476–1482. doi: 10.1016/j.eswa.2013.08.044.
[120] Weichao Zhou and Wenchao Li. “Safety-Aware Apprenticeship Learning”. In: Computer Aided
Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic
Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part I. Ed. by Hana Chockler and
Georg Weissenbacher. Vol. 10981. Lecture Notes in Computer Science. Springer, 2018,
pp. 662–680. doi: 10.1007/978-3-319-96145-3\_38.
[121] Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi,
Soroush Nasiriany, and Yifeng Zhu. “robosuite: A Modular Simulation Framework and
Benchmark for Robot Learning”. In: arXiv preprint arXiv:2009.12293. 2020.
[122] Brian D. Ziebart. “Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal
Entropy”. PhD thesis. USA: Carnegie Mellon University, USA, 2010. isbn: 9781124414218. doi:
10.1184/r1/6720692.v1.
118
[123] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. “Maximum Entropy
Inverse Reinforcement Learning”. In: Proceedings of the Twenty-Third AAAI Conference on
Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008. Ed. by Dieter Fox and
Carla P. Gomes. AAAI Press, 2008, pp. 1433–1438. url:
http://www.aaai.org/Library/AAAI/2008/aaai08-227.php.
119
Abstract (if available)
Abstract
Learning-from-demonstrations (LfD) is a popular paradigm to obtain effective robot control policies for complex tasks via reinforcement learning (RL) without the need to explicitly design reward functions. However, it is susceptible to imperfections in demonstrations and also raises concerns of safety and interpretability in the learned control policies. To address these issues, this thesis develops a neurosymbolic learning framework which is a hybrid method that integrates neural network-based learning with symbolic (e.g., rule, logic, graph) reasoning to leverage the strengths of both approaches. Specifically, this framework uses Signal Temporal Logic (STL) to express high-level robotic tasks and its quantitative semantics to evaluate and rank the quality of demonstrations. Temporal logic-based specifications enable the creation of non-Markovian rewards, and are capable of defining interesting causal dependencies between tasks such as sequential task specifications. This dissertation first presents the LfD-STL framework that learns from even suboptimal/imperfect demonstrations and STL specifications to infer reward functions; these reward functions can then be used by reinforcement learning algorithms to obtain control policies. Experimental evaluations on several diverse sets of environments show that the additional information in the form of formally-specified task objectives allows the framework to outperform prior state-of-the-art LfD methods.
Many real-world robotic tasks consist of multiple objectives (specifications), some of which may be inherently competitive, thus prompting the need for deliberate trade-offs. This dissertation then further extends the LfD-STL framework by a developing metric - performance graph - which is a directed graph that utilizes the quality of demonstrations to provide intuitive explanations about the performance and trade-offs of demonstrated behaviors. This performance graph also offers concise insights into the learning process of the RL agent, thereby enhancing interpretability, as corroborated by a user study. Finally, the thesis discusses how the performance graphs can be used as an optimization objective to guide RL agents to potentially learn policies that perform better than the (imperfect) demonstrators via apprenticeship learning (AL). The theoretical machinery developed for the AL-STL framework examines the guarantees on safety and performance of RL agents.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Learning logical abstractions from sequential data
PDF
Algorithms and systems for continual robot learning
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Accelerating robot manipulation using demonstrations
PDF
Scaling robot learning with skills
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Learning objective functions for autonomous motion generation
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Closing the reality gap via simulation-based inference and control
PDF
Foundation models for embodied AI
PDF
Planning and learning for long-horizon collaborative manipulation tasks
Asset Metadata
Creator
Puranic, Aniruddh Gopinath
(author)
Core Title
Sample-efficient and robust neurosymbolic learning from demonstrations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
04/10/2024
Defense Date
03/26/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,formal methods,learning from demonstrations,neurosymbolic AI,OAI-PMH Harvest,reinforcement learning,robotics,temporal logic
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Deshmukh, Jyotirmoy (
committee chair
), Nikolaidis, Stefanos (
committee member
), Sukhatme, Gaurav (
committee member
), Tu, Stephen (
committee member
)
Creator Email
andyruddh@gmail.com,puranic@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113871308
Unique identifier
UC113871308
Identifier
etd-PuranicAni-12790.pdf (filename)
Legacy Identifier
etd-PuranicAni-12790
Document Type
Dissertation
Format
theses (aat)
Rights
Puranic, Aniruddh Gopinath
Internet Media Type
application/pdf
Type
texts
Source
20240409-usctheses-batch-1138
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
formal methods
learning from demonstrations
neurosymbolic AI
reinforcement learning
robotics
temporal logic