Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Verification, learning and control in cyber-physical systems
(USC Thesis Other)
Verification, learning and control in cyber-physical systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VERIFICATION, LEARNING AND CONTROL IN CYBER-PHYSICAL SYSTEMS by Panagiotis Kyriakis A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Panagiotis Kyriakis Acknowledgements In this thesis culminates an over a decade long effort to conquer knowledge, continuously learn and become an independent researcher. This would not have been possible without the continual support of all of the following. First and foremost, I would like to thank my advisor Prof. Paul Bogdan who, through his intellectual curiosity, passion and radical research ideas, made it possible for me to pursue and materialize all of the projects presented in this thesis. Also, I would like to thank my colleagues and fellow Ph.D. students from the Cyber-Physical Systems group (especially Ridha) for being there to provide support and guidance for any technical obstacle arising in our projects. Additionally, I would like to thank my friends from the Hellenic Student Association of USC who made it possible for me to retain the much needed cultural identity. Also, I would like to thank my parents and my brother for always being there for me, for patiently waiting and eagerly looking forward to the next time I visit them and for making each of my returns back home an unforgettable experience. Finally, I would like to thank Eleni for her continous and unconditional support, especially at times when I show an unnatural fixation to work, and for many other reasons; too many to list. ii Table of contents Acknowledgements ii List of Tables v List of Figures vi Abstract viii 1 Introduction 1 1.1 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Verification under Uncertainty 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Signals, Systems and Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Controller Design via Robustness . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Specification Mining for StTL . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 Evaluation of the Specification Mining Algorithm . . . . . . . . . . . 29 2.5.2 MPC-based Adaptive Cruise Controller . . . . . . . . . . . . . . . . . 30 2.6 Related Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Learning Topological Features 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Persistent Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Persistent Poincare Representations . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Case Studies and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Graph Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Learning Hierarchical Features 58 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Visual Information Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Hierarchical Variational Autoencoders . . . . . . . . . . . . . . . . . 63 4.3.2 Neural Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 iii 5 Safe Control under Uncertainty 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 StTL-Constrained Maximum Causal Entropy . . . . . . . . . . . . . . . . . . 77 5.4 Cut and Generate Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Control of Complex Networks 90 6.1 Fundamentals of Complex Network Control . . . . . . . . . . . . . . . . . . 90 6.2 Network Model and Problem Statement . . . . . . . . . . . . . . . . . . . . . 92 6.3 Minimum Energy State Transfer . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 Efficient Actuator Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7 Control under Multiple Objectives 105 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Fundamentals of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 108 7.2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2.2 Multi-Objective Markov Decision Processes . . . . . . . . . . . . . . . 109 7.3 Learning with Unknown Preferences . . . . . . . . . . . . . . . . . . . . . . . 110 7.4 Pareto Policy Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 8 Summary and Future Research 124 Bibliography 126 Appendices 143 A Verification and Design under Uncertainty . . . . . . . . . . . . . . . . . . . 143 A.1 Timed Traces and Master Equation . . . . . . . . . . . . . . . . . . . 143 B Hyperbolic Representation Learning . . . . . . . . . . . . . . . . . . . . . . . 144 B.1 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B.2 Riemmanian Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.3 Poincare Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 B.4 Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 B.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 148 C Multi-Objective Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 150 C.1 Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 C.2 Pareto Stationarity: 2-dimensional case . . . . . . . . . . . . . . . . . 152 C.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 iv List of Tables 3.1 Classification accuracy (mean±std or min-max range, if available). . . . . . . 53 3.2 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Then-way correct identification rate (n = 2, 5, 10) for all ablated variants using the Person Correlation Coefficient (PCC) and the Structural Similarity Index Measure (SSIM) as a selection criterion. We report the mean across subjects. The inter- subject standard deviation was in the range of 0.02− 0.05. The chance levels are 0.5, 0.25, 0.10, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.1 Single-Goal GridWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Multi-Goal GridWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.1 Comparison of MORL agents using the Pareto Dominance (PD) and Average Utility (UT). Possible values for each metric lie in the range [0, 1] and [−1, 1], respectively. In both cases, higher values indicate better performance. We evaluate the agents using 10 random seeds and report the first and second order statistics. . . . . . . 120 v List of Figures 2.1 Overview of Research Problems and Solutions . . . . . . . . . . . . . . . . . 11 2.2 Distance to Boundary and Predicate Robustness . . . . . . . . . . . . . . . . 21 2.3 Comparison of Algorithm 2 against the grid method. Red denotes the satis- faction boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Block Diagram of the MPC-based ACC System . . . . . . . . . . . . . . . . 32 2.5 (a) Mined Parameters for a conservative (K = 0.1), balanced (K = 0.5) and aggressive (K = 0.9) controller. (b) Satisfaction and Robustness signals for Eq. 2.23 (c 1 = 60m,c 2 = 30m/s). The satisfaction signals overlap, and hence only one of them is visible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Controller Design using PSO to maximize the robustness of Eq. (2.23) for different values of c 1 ,c 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Illustration of our method: Initially, the points are transferred via the aux- iliary transformation ρ and the parameterization φ to the Poincare ballB, where learnable parameters θ are added. Then, the logarithmic map is used for transforming the points to the tangent space T x 0 B. Finally, the resulting vectors are added and transformed back to the manifold via the exponential map. Note that the persistence diagram is mapped to a single point on the Poincare ball (i.e., Φ(D,θ)∈B). . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Left: Example graph from the IMDB-BINARY dataset. Middle: Persistence diagrams extracted using the Vietoris-Rips filtration. The dashed line denotes features of infinite persistence, which are represented by points of maximal death value equal to 90 (i.e., by points of finite persistence). Right: Equiv- alent representation on the 2-dimensional Poincare ball. Features of infinite persistence are mapped infinitesimally close to the boundary. Therefore, their distance to finite persistence features approaches infinity (d∼ −2 ). . . . . . 50 3.3 Plotting the train loss (left) and the validation accuracy (middle) over 10 training epochs for the MNIST dataset using two different dimensions for the Poincare ball (m = 3 and m = 9). . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Brain Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Two stream hypothesis of visual information processing in the human brain. . . . 62 vi 4.3 Outline of our method: a) We pretrain a Hierarchical Variational Autoencoder on a large set of natural images. Two layers of latent variables z 1 ,z 2 are inserted after each encoder (EB) and decoder (DB) block. b) We train the Neural Decoder using by discarding the encoder from the previous step and learning a map from the fMRI voxels to the hierarchical latent space. The lower visual cortex (V1, V2, V3) is mapped to z 1 and the higher visual cortex (FFA, PPA, LOC) to z 2 . . . . . . . 64 4.4 Qualitative comparison with other methods. . . . . . . . . . . . . . . . . . . . . 70 4.5 Correct identification ratio for different methods. . . . . . . . . . . . . . . . . . 71 4.6 Qualitative comparison for different pathways. . . . . . . . . . . . . . . . . . . . 71 4.7 Learning curves forCIR n M ,n = 2, 5, 10 andM∈{PCC,SSIM} across all subjects. The horizontal axis is the number of epochs. Subject 3 is marginally outperforming the other subjects and Subject 1 gives the worst performance (figure best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.1 Hyperplane approximation for the formula ϕ =μ 1 ∨μ 2 . The feasible region of ϕ is D 1 ∪D 2 and is approximated by the union of the two hyperplanes. The union is non-convex and a binary variable is introduced that controls which constraint is active. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Recovered optimal stochastic policies for L1-L4. The agent starts from (0, 0) and receives a reward equal to 1 when it reaches (4, 4). The color-map of each triangle in the corresponding rectangle denotes the probability of taking the underlying action. 85 5.3 The multi-goal gridworld enviroment along with the policy recovered under speci- fication ϕ 3 ∧ϕ 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Empirical complexity estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.1 (a) Erdos-Renyi (p = 0.4) network of n = 30 nodes. Smaller nodes have smaller AP cost. Algorithm 5 selected 12 nodes to be actuated (shown in red). (b) Barab´ asi–Albert (m 0 = 5,m = 4) network ofn = 30 nodes. Smaller nodes have smaller AP cost. Algorithm 5 selected 14 nodes to be actuated (shown in red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1 True and identified Pareto front for Deep Sea Treasure enviroment. . . . . . . . . 108 7.2 (a) The Convex Coverage Set (CCS) for a 2-dimensional value function. Algo- rithm 6 identifies arbitrary solutions on the Pareto front. (b) Illustration of the PPA loss function for a single preference vector ω 0 : by minimizing the distance ω ∗ (θ)−ω 0 , the solution converges to the optimal solution corresponding to pref- erenceω 0 . (c) PPA loss function for two preference vectorsω 0 ,ω 1 : the minimizer of ω ∗ (θ)−ω 0 + ω ∗ (θ)−ω 1 (also called the geometric median) gives the optimal solution because it maximizes the projections onto the preference vectors. . . . . . 115 7.3 Results on the MOGW and DST environments. . . . . . . . . . . . . . . . . . . 119 7.4 PPA on MuJoCo environments. We show the Pareto Dominance (PD) and Average Utility (UT) metrics. Experiments are run for 10 random seeds. We compare against the non-adaptive PPA and the fixed-preferences agents. . . 122 1 OpenAI Gym enviroments used in our experimental results. . . . . . . . . . . . 153 vii Abstract From autonomous vehicles to edge computing and from robots to implantable medical de- vices, Cyber-Physical Systems (CPS) have penetrated every aspect of human life. With pervasiveness come challenges for engineers to identify and tackle novel problems. The in- teraction between humans and autonomous agents operating in the same environment has created the need to provide formal safety specifications. The plethora of multi-modal sys- temic data collected from the sensors necessitates the choice of appropriate feature space for representation such that the data is useful for modeling, analysis, and control. The con- troller design for such systems requires taking into account the non-Markovian nature of the constructed systemic models and to optimize under conflicting objectives. Motivated by these challenges, this thesis lays the theoretical and algorithmic foundations for modern Cyber-Physical Systems (CPS) by making multiple, distinct, and independent contributions in the fields of verification, learning, and control. In the field of verification, we introduce Stochastic Temporal Logic (StTL) to handle systemic and environmental un- certainty arising in modern CPS. We leverage our StTL formalism for specification mining and robust design under uncertainty as well as for designing safe control behaviors from imperfect demonstrations. In the field of learning, we develop methods to learn structured feature representations from CPS sensory and vision data. We develop a persistent homol- ogy framework for extracting topological features and we learn to represent such features in hyperbolic spaces. Additionally, we use variational models to capture the hierarchical struc- ture of the latent feature space. Finally, in the field of control, we address the challenges of viii non-Markovian dynamics and multi-criteria optimization for control. For the former case, we model non-Markovianity via fractional order models and study how the topological struc- ture affects the controllability properties. For the later case, we explore gradient descent methods for the data-driven optimization of multi-objective control criteria. All of our con- tributions are backed by theoretical and algorithmic results and validated in multiple case studies inspired by CPS applications. ix Chapter 1 Introduction 1.1 Challenges and Motivation As the self-driving car marches down the road, the cameras and LIDAR sensors work in harmony with the accelerator and steering angle actuators to ensure that the lane is tracked, a safety distance is kept and the posted signs are followed. The sensory input is mapped to control actions by a cutting-edge deep reinforcement learning controller which has been trained on previously gather behaviors from expert drivers. The controller is optimizing a collection of objectives including the energy cost, ride comfort and time-to-destination and finds the perfect trade-off among them. The traffic lights have embedded vision systems which perceive the traffic and relay that information to a centralized location to construct a traffic model and optimize the switching such that congestion is avoided. The ”driver” of the car sits comfortably in its cabin with a portable brain scanner connected to her head. The car dashboard has a computer screen and the driver is responding to her email and browsing the web, all without moving a finger; the brain scanner decodes her thoughts and intentions and send the appropriate commands to the computer. Suddenly a pedestrian jumps in front of the car and the safety assurance controller takes over: the brake is applied at full force and the driver is abruptly woken up. She realizes what happened and exhales in relief. Her increased heartbeat is recorded by her watch and the data in relayed to her medical care provider, where they are integrated with the rest of her medical record. Later the day she is due to receive her personalized recommendation for life-style and medication alterations to reduce stress and increase longevity. She looks out of the window of her car and sees a swarm of drones making package and food deliveries to nearby apartments. She thinks that 1 she needs to order food for when she get back home, and within second the computer places the order. She drives away, not taking over the steering wheel; she fully trusts the system. The aforementioned futuristic description is a Smart City, one of the most complicated and inclusive Cyber-Physical Systems. As a Cyber-Physical System (CPS) researchers un- derstand engineering systems that are deeply intertwined with software components, are able to operate in different spatiotemporal scales and are characterized by multiple modalities. CPS integrate sensing, actuation, computation, control and networking into inter-connected physical objects, humans, and infrastructure. Even though a Smart City as the one described above is a mere conceptulization, we can readily recognize that some of the sub-systems (i.e., the self-driving) have already reached commercialization and others are in research devel- opment. Therefore, we can safely assume that it is a matter of time before the Smart City becomes part of our lifes. With pervasiveness come great challenges for engineers to properly analyze, design and maintain the sub-systems. The Smart City design as described above, raises several novel questions: 1. How do we incorporate systemic and environmental uncertainty in the formalism used for expressing safety specifications? 2. How do we design the self-driving car safety assurance controller such that it satisfies safety specifications? 3. How do we leverage the brain recordings from a human subject to decode her visual imagery and thoughts? 4. How do we design the deep reinforcement learning controller such that the optimal trade off between objectives is achieved? 5. Which traffic lights should we control and what is the control signal such that a global objective is attained? 2 All the aforementioned challenges arise due to the unique and complex nature of the CPS and is by no means an exhaustive list of challenges in designing the Smart City. In answering these questions, we need to devise novel tools and frameworks and to re-think how existing methods can be tailored to CPS. 1.2 Thesis Contributions This thesis lays the theoretical and algorithmic foundations for answering the questions raised in Sec. 1.1. We make multiple, distinct and independent contributions in the fields on verification, learning and control, all of which are summarized bellow: Our first contribution lies in the field of formal verification aims to handle environmen- tal and systemic uncertainty in CPS. In more detail, we propose Stochastic Temporal Logic (StTL) [1, 2] as a formal language for expressing rich and expressive safety and performance specifications in the context of CPS. We formulate an algorithm to mine StTL specifications from CPS models and data. Given a parametric StTL formula (i.e. an StTL formula, where certain numerical constants are replaced by formula parameters), our mining algorithm in- fers values for the formula parameters that lead to the marginal satisfaction of the StTL formula. Following that, we introduce quantitative semantics for StTL that provide us with a mathematical approach to quantify the robustness of a system evolving in an uncertain dynamic environment. Using the quantitative semantics of StTL, we provide a technique to estimate the robustness of the given system. We define the robustness estimate as the maximal perturbation to a given StSTL specification such that the specification continues to be satisfied (or violated) by the given stochastic system. Finally, we develop a new controller design technique for stochastic dynamical systems, where we express the controller design specifications as StTL formulas, compute their robustness and use this quantity to guide a black-box (stochastic) optimization algorithm which explores the controller parameter space attempting to find a global maximum of the robustness estimate. 3 Following that, we leverage our StTl formalism to learn safe control policies from imper- fect demonstrations. In the field of learning from demonstrations (i.e., inverse reinforcement learning or inverse optimal control), we essentially aim to find the reward that the agent is optimizing, having access only to a set of trajectories. Given that the problem is ill-posed, as many rewards can explain a given set of trajectories, the Maximum Casual Entropy (MCE) framework resolve the ambiguity by maximizing the entropy of the trajectory distribution. Our contribution extends the MCE framework to account for safety and/or performance requirements expressed as Stochastic Temporal Logic (StTL) specifications [3]. We show how to encode the StTL formula via a minimal set of mixed-integer linear constraints. We generate these constraints using cutting hyperplanes as an approximation of the feasible re- gions of atomic predicates and then we recursively propagate these hyperplanes through the formula schematics. The benefits of our approach are two-fold: first, our encoding produces a minimal set of integer constraints, and second, by using cutting hyperplanes, we avoid having to propagate non-linear constraints. Our algorithmic contributions are validated in different environments and specifications. Our second contribution is in the field of learning and aims to deal with the topological and hierarchical features arising from CPS sensory and vision input. For the former case, we focus on a specific class of features known as persistence diagrams, which record the persistence of topological features across varying scales [4]. We identify that current state of the art methods for representing such features are inherently flawed because they fail to address the existence of features of infinite persistence (essential features), which appear particularly often in CPS sensory recordings. Motivated by that, we introduce a method to represent persistence diagrams on a hyperbolic space, more specifically on the Poincare ball. We define a learnable parameterization of the Poincare ball and leverage the vectorial structure of the tangent space to combine the representations of individual points of the persistence diagram. Our method learns better task-specific representations than the state of the art because it does not shrink the relative importance of essential features. In fact, by 4 allowing the representations of essential features to get infinitesimally close to the boundary of the Poincare ball, their distance to the representations of non-essential features approaches infinity, therefore preserving their relative importance. On the later case, we introduce variational models to capture latent space hierarchical structures. Classical variational models learn a generative distribution over the data using a single latent space. The latent vector is essentially a low-dimensional encoding of the data and by Gaussian regularization we guarantee that the latent space can be easily sampled. The single-latent space assumption may bottleneck learning because there are physical processes which are characterized by a hierarchical information processing. This is particularly evident in the visual information processing occurring in the human brain, where the lower visual cortex is encodes low-level image features and the higher cortex encodes high-level features. Motivated by this fact, we introduce Hierarchical Variational Autoencoders (HVAEs) as a method to learn meaningful representations and we leverage our architecture for the problem of decoding visual imagery from brain recordings [5]. By mapping the early stages of the visual pathway to the first set of latent variables and the higher visual cortex areas to the deeper layers in the latent hierarchy, we are able to construct a latent variable neural decoding model that replicates the hierarchical visual information processing. Our model achieves better reconstructions compared to the state of the art and our ablation study indicates that the hierarchical structure of the latent space is responsible for that performance. Our third contribution lies in the field of control and aims to address the non-Markovian and multi-agent spatio-temporal dynamics arising in modern CPS. To address the non- Markovian dynamics in CPS, we focus on the model-based case and provide a unified frame- work for controlling complex networks that takes into account a wide spectrum of memory dynamics [6, 7]. In more detail we propose a linear, discrete time, fractional order dynamical system as a model for complex, multi-agent dynamical networks with long-term memory. Furthermore, we consider the cost-efficient control of this system, where we assume hetero- geneous costs and upper-bound the total cost via a knapsack-like constraint. Building on 5 recent advances in theoretical computer science for submodular optimization under knapsack constraints, we present a greedy algorithm for the placement problem with approximation guarantees. From extensive experimental results for Erd˝ os–R´ enyi, and Barab´ asi–Albert com- plex networks, we observe that the proposed algorithm achieves on average 95% of the global optimal objective value. Following that, we steer our attention to the model-free case and address the problem of controlling under competing criteria arising in CPS [8]. More specifically, we assume un- known linear preferences for the different criteria and explore Pareto optimal solutions as way to trade off among the different criteria. We develop the Pareto stationarity as a neces- sary, first-order condition for Pareto optimality and design a gradient algorithm which uses a projected gradient descent solver to search for and take steps in a common ascent direction for all objectives. Following that, we tackle the problem of adapting the gradient method to a given preference distribution. We utilize our method from the first part and introduce the Pareto Policy Adaptation (PPA), a loss function that penalizes deviations between the given and recovered preferences. Using recent advances in implicit differentiation, we are able to back-propagate the gradient of PPA bypassing the operations of the projected gradient descent solver, which makes our method applicable to real-world problems. We evaluate our method in a series of reinforcement learning tasks, show the effectiveness of our method in practice and compare against other state-of-the art approaches. 1.3 Thesis Outline The thesis is organized as follows: In Chapter 2 we focus on CPS verification and we intro- duce Stochastic Temporal Logic (StTL) as a formalism to handle systemic and environmental uncertainty. We leverage StTL for specification mining and robust design and we illustrate the utility of our method experimental results inspired by the autonomous vehicle domain. Chapter 3 and 4 are devoted in addressing the representation learning for topological and 6 hierarchical features, respectively, arising in CPS. We use an ensemble of tools from topo- logical data analysis, persistence homology and variational inference and we illustrate our methods in several CPS-inspired learning tasks. In Chapter 5, 6, 7 we focus our attention on CPS control. In Chapter 5 we show how our contributions Chapter 2 can be utilized in the context of learning safe control policies from imperfect demonstrations, whereas in Chapter 6 we introduce a method to control complex networks of varying topologies with long-term memory. Finally, in Chapter 7, address the problem of learning optimal control policies under multiple criteria and introduce a gradient method. All of our contributions are backed by theoretical and algorithmic results and illustrated in CPS-inspired applications. 7 Chapter 2 Verification under Uncertainty 2.1 Introduction Formal verification lies in the heart of safety and security, which are of paramount impor- tance in CPSs. The motivating example of the Smart City mentioned in Sec. 1.1 highlights the integral part of safety: in a CPS setting, each agents regularly interact with other agents and humans and certain critical safety specifications must always satisfied (e.g., never change lanes if the blind spot is not clear). Therefore, to ensure safety in such systems it is impera- tive that we construct controllers that are robust to systemic and environmental variations. Another noteworthy property of modern CPSs is that they are characterized by unstructured environmental uncertainty, as it is evident in the aforementioned Smart City example. To construct robust cyber-physical system (CPS) applications, we have to carefully rea- son about uncertainties occurring in the environment as well as in system components (e.g., unforeseen interdependencies, parameter variations, hardware aging, software updates that do not fully account for hardware variations). There are typically two divergent views on tackling uncertainty. The first assumes that uncertainty is nondeterministic, i.e., all feasible behaviors of the environment or the system can occur. However, for systems that interact with a physical environment, such an assumption can lead to over-constrained designs or overly-pessimistic assessments of system correctness. The second approach relies on a prob- abilistic framework for defining system correctness. This approach is suitable for systems that are stochastic in nature, or have highly uncertain environments that cannot be precisely modeled. 8 The simplest approach for modeling stochasticity is to model variables in the system (or environment) as random variables with a parametric probability distribution. For instance, it is common to use a normal distribution with unknown mean and variance to model a random variable. The use of such a distribution is typically justified from empirical observations of the system, and the parameters of the distribution (mean and variance in this case) are then estimated from data. Such assumptions are routinely used to model time-invariant stochasticity, such as to model noise in sensors or manufacturing variations in physical models [9]. However, modern CPS applications often operate in uncertain environments. For ex- ample, consider applications such as air traffic management where the underlying system is modeled as a stochastic hybrid system [10], probabilistic collision detection models for autonomous cars [11], or closed-loop control of blood glucose in artificial pancreas using stochastic, fractional dynamical system models [12]. For systems with such time-varying disturbances and uncertain dynamics, a better approach to modeling uncertainty is to treat the system as a stochastic dynamical system, where the probability distribution of the system variables is a function of time, described by an appropriate stochastic differential equation. Constructing such models enables reasoning about several interesting problems such as the design of controllers, probabilistic reasoning of system correctness, and quantifying the ro- bustness of a system design. It is common for stochastic models for such systems to be data-driven, i.e., empirically constructed from observations of systemic data [13]. There are two main challenging problems in CPS models operating in uncertain environ- ments. The first is to precisely articulate their behavioral specifications. In certain settings, such as for legacy applications, or for emergent applications requiring rapid prototyping, specification-driven design is often infeasible. In such cases, post hoc mining of formal design specifications has proven to be an interesting approach that adds significant value to the design process [14, 15, 16]. The second challenge is to obtain controllers that ensure that the system satisfies its formal specifications under appropriate uncertainty models. 9 Recently, there has been considerable research on using real-time temporal logics such as Signal Temporal Logic (STL) [17] for expressing properties and abstractions of closed-loop cyber-physical system models [18, 19, 20, 21], and subsequent use of such specifications for control design. However, for effective probabilistic reasoning for stochastic system models, we need mathematical frameworks that include probabilities as part of the system specification. Recently, STL was extended to three probabilistic variants, Probabilistic STL (PrSTL) [22], Chance Constrained Temporal Logic (C2TL) [9], and Stochastic STL (StSTL) [23]. The logics differ in technicalities of how atomic predicates are defined, but they essentially provide a logical characterization of uncertainty. A recent paper proposes Stochastic Temporal Logic (StTL) as a formalism that unifies certain features from these logics [24]. The research in [23, 9, 22] crucially provides techniques to design controllers under specific uncertainty models. Contributions: In this chapter, we seek to use these logics to achieve the two research challenges we outlined before: designing controllers from uncertainty-aware specifications, and mining such specifications from CPS models and data. Specifically, the contributions are as follows: 1. We formulate a new controller design technique for stochastic dynamical systems with the goal of satisfying a given set of uncertainty-aware temporal specifications. For this prob- lem, we seek to find the controller parameters such that the system satisfies the specification. To this end, we initially introduce Stochastic Temporal Logic (StTL) as a formal language to encode the design specifications. 2. We introduce quantitative semantics for StTL that provide us with a mathematical approach to quantify the robustness of a system evolving in an uncertain dynamic environ- ment. Using the quantitative semantics of StTL, we provide a technique to estimate the robustness estimate of the given system. We define the robustness estimate as the maxi- mal perturbation to a given StSTL specification such that the specification continues to be satisfied (or violated) by the given stochastic system. 10 Black-box Model M (C (q)) t x Output Traces t x 0 1 x1 x2 p(x1,x2,t) Stochastic Model Ensemble Statistics/GME StTL formula ϕ PStTL formula ϕ(π) PSO/Global Optimizer (explores ν(q) overPq) Robustness Estimate of ϕ as Cost Optimal Valuation ν ? for Controller Parameters q Specification Mining (explores ν(π) overPπ ) Validity Domain Boundary for ϕ(π) Simulation + + ν(q) Figure 2.1: Overview of Research Problems and Solutions 3. We develop a new controller design technique for stochastic dynamical systems, where we express the controller design specifications as StTL formulas, compute their robustness and use this quantity to guide a black-box (stochastic) optimization algorithm which explores the controller parameter space attempting to find a global maximum of the robustness esti- mate. 4. We formulate an algorithm to mine StTL specifications from CPS models and data. Given a parametric StTL formula (i.e. an StTL formula, where certain numerical constants are replaced by formula parameters), our mining algorithm infers values for the formula- parameters that lead to the marginal satisfaction of the StTL formula. Finally, we establish a connection between specification mining and robustness estimation, which allows us to speed-up the robustness computation in certain cases. 2.2 Signals, Systems and Logic Stochastic Dynamical Systems. We now introduce fundamental concepts and terminol- ogy used throughout the rest of the chapter. We consider a controlled stochastic dynamical systemM(C) as an abstract model of a physical system whose behavior is regulated with a software controller. The controllerC(q) depends on a set of parameters q known as the 11 controller parameters, assumed to belong to some controller parameter spaceP q . We ab- breviateM(C) toM whenC is obvious from the context. We assume that, except for the controller parameters q,M is a black-box model with a numerical simulator. This implies that, upon setting values to the controller parameters, we can obtain example behaviors of the systemM without having any explicit representation of the dynamics using the numer- ical simulator. For example,M can be considered to be a model that uses proprietary code or model components 1 , which may not provide analytic description of the dynamics of each component. Alternatively,M could be a physical system whose analytic dynamics are too cumbersome to construct. In both cases the black-box assumption allows to consider a broad class of systems and study them using example behaviors obtained by numerical simulations or physical testing. The term stochastic implies that the dynamics ofM is dependent on an input that can be possibly perturbed by the environment, and/or noisy sensors or actu- ators, and/or internal randomness in the physics that is being modeled. The stochasticity assumption implies that the simulation or testing of the system will give different outputs for the same controller parameters. This output is associated with a transition function Φ M :P q ×R ≥0 → X, where X⊆R n is the output space. Let ν be a valuation function mapping the controller parameters q to some value inP q . The output of the model is defined by the transition function x ν(q) (t) = Φ M (ν(q),t) at time t∈ R ≥0 . SinceM is dependent on uncertainties in the environment or the system itself, the output x ν(q) (t) is essentially a stochastic process. The formal definition of a stochastic process is as follows. Definition 2.1 (Stochastic Process). A probability space is a triple (Ω,F,P), where Ω is the set of all possible outcomes,F is theσ-algebra 2 of subsets of Ω andP is a probability measure that associates each subset with its probability (a real number in [0, 1]). Given a measurable output spaceX∈R n , a random variablex is a measurable functionx : Ω→X. A stochastic processx(t) is a collection of random variables indexed by (discrete or continuous) time. The 1 For example as a precompiled block within a Simulink ® model. 2 A collection of all subsets of Ω that is closed under complement, countable unions and intersections. 12 evolution of x(t) is given by the time varying probability density function (PDF) p(x,t) and its corresponding cumulative distribution function (CDF) F x (x,t). The stochastic process 3 x(t) gives a complete description of the output of the system. How- ever, exact modelling of x(t) is not only hard to obtain, as argued previously, but could also be impractical as the analytic description could be a highly complicated mathemati- cal expression. Therefore, we rely on approximating x(t) using the output traces given by the black-box simulation or testing of the system. Formally, the output trace is defined as follows: Definition 2.2 (Output Trace). LetM =M(C) be a controlled stochastic dynamical system and x ν(q) (t) = Φ M (ν(K),t) the output stochastic process (x(t) for brevity). An output trace or realization of this process is the sequence (t 0 ,x 0 ),..., (t T ,x T ) obtained by sampling a realization of the stochastic process x(t) at discrete time instances t 0 <t 1 <...<t T . An output trace or signal (the terms are often used interchangeably) is essentially a time- indexed collection of multidimensional measurements of the system variables of interest. We now describe a technique that allows us to learn a stochastic model of the system under consideration from individual realizations of its stochastic output process. Assume that we have tuned the controller parameters to the desired values and use the black-box model of the system to obtain M output traces of length T , namely{(t 0 ,x m 0 ),..., (t T ,x m T )} M m=1 . Subsequently, for each time instancet n , using the ensemble statistics method, we can estimate the probability distribution of the system output x at time t n (denoted p(x,t n )) as follows: Definition 2.3 (Model based on Ensemble Statistics). p(x,t n ) = 1 Mh M X m=1 λ x−x m n h (2.1) 3 When there is no confusion the subscript ν(q) may be omitted. 13 where λ is a kernel density function and h ¿ 0 is a smoothing parameter that is chosen empirically. Typically, the kernel function chosen is some symmetric probability density function such as a Gaussian kernel. This approach makes few assumptions about the dynamics of the stochastic system and relies mainly on our ability to simulate it and obtain output traces. Methods for choosing the kernel as well as the bandwidth are discussed extensively in [25]; we primarily use the Gaussian kernel and the smoothing parameter h = 1. We note that the ensemble statistics method is far from the only technique to obtain a stochastic model from data. A more general method is based on using generalized master equations (GMEs) [26], that are popular modeling formalisms in statistical physics, chemistry and biology. We give a brief background of GMEs and their relation to the ensemble statistics method in the appendix. Stochastic Temporal Logic. In the context of stochastic dynamical systems, formulating rich and expressive specifications requires that we use a formal language which takes into account the non-deterministic nature of the systemic variables. Stochastic Temporal Logic, first introduced in [24] is an amalgam of logics such as StSTL [23] and C2TL [9]. For a random variable x, letP(x) denote its probability. StSTL formulas are defined over atomic predicates represented by chance constrains of the form: μ :=P(f(x(t))≤c)≥ (2.2) wherec∈R,f :X→R (fis an invertible real-valued function),x(t) is the random variable associated with the stochastic process x, and ∈ [0, 1]. We assume that the time index t is defined over T, which is some subset of positive real numbers. The syntax of StTL is 14 identical to that of STL where atomic predicates 4 are replaced by chance constraints of the aforementioned form: ϕ :=μ | ¬ϕ | ϕ∧ϕ | ϕU I ϕ (2.3) where I is a time interval, i.e., any convex subset of T. The truth value of the atomic predicate is interpreted based on the satisfaction properties of the atomic predicate, namely it is true (denoted with>) if and only if f(x(t))≤ 0 holds with probability at least and false (denoted with⊥) otherwise. Definition 2.4 (Recursive Semantics of StTL). The semantics of StTL define what it means for an StTL formula to be satisfied by a given stochastic process x: (x,t)| =μ ≡ P(f(x(t))≤c)≥ (x,t)| =¬ϕ ≡ ¬((x,t)| =ϕ) (x,t)| =ϕ 1 ∧ϕ 2 ≡ (x,t)| =ϕ 1 ∧ (x,t)| =ϕ 2 (x,t)| =ϕU I ψ ≡ ∃t 0 ∈t⊕I : ((x,t 0 )| =ψ∧ ∀t 00 ∈ [t,t 0 ) : (x,t 00 )| =ϕ) (2.4) In the above definition, the⊕ operator denotes the Minkowski sum of the interval witht, i.e., if the interval I = [a,b], then t⊕I denotes the interval [t +a,t +b]. The formulas F I ϕ and G I ϕ are introduced as short-hand for trueU I ϕ and¬F I ¬ϕ, respectively. Following the usual convention for logics such as STL, we write x| =ϕ as a short-hand notation for (x, 0)| =ϕ. We now define the Boolean satisfaction signal, which we extend to the quantitative semantics in the next section. 4 StTL can be generalized to include chance constraints over arbitrary Boolean conjunctions of predicates over the given stochastic process, see [24] for the precise syntax. All the algorithms can be extended to the full StTL syntax; we omit this extension for ease of exposition. 15 Definition 2.5 (Satisfaction Signal). The satisfaction signal for an StTL formula ϕ is de- fined as a function fromT to{>,⊥} such that: χ(ϕ,x,t) = > iff (x,t)| =ϕ ⊥ otherwise (2.5) Parametric Stochastic STL (PStTL) is an extension of StTL introduced to define tem- plate formulas. Essentially, a template formula is an StTL formula in which some numeric constants are replaced by formula parameters. Letπ be a set of formula parameters. The set π can be partitioned into three disjoint sets of variables π c ,π , andπ τ , of which at least one is nonempty. The variables inπ c ,π andπ τ are called value parameters, probability threshold parameters and time domain parameters, respectively; they take values in some compact spaceP πc , [0, 1], and [0,t T ]. As for control parameters, we use ν to denote a valuation func- tionν that maps any of the above formula parameters to a value in their respective domains. An StTL formula ϕ is obtained by pairing a PStTL formula ϕ(π) with a valuation function ν. Problem Definitions. We outline two main problems that motivate the work in this part of the thesis. 1. Control Design Problem: Given an StTL formula ϕ, find a valuation ν ? for the con- troller parameters q such that the controlled stochastic output process corresponding to M(C(ν ? (q))) satisfies the specification ϕ. 2. Specification Mining Problem: Given an PStTL formulaϕ(π), and a fixed valuation ν fixed for the controller parameters q of the controlled modelM, identify valuations ν ? for the formula parameters π such thatM(C(ν fixed )) satisfies ϕ(ν ? (π)). 16 2.3 Controller Design via Robustness The main objective of this section is to propose a method to address the control design prob- lem. At the outset, we state (without proof) that the control design problem is undecidable in general. The proof follows from the undecidability of the verification problem for arbi- trary dynamical systems with specifications that are simple subset of STL [27]. Hence, we develop a best effort technique that attempts to design a robust controller that maximally (but not provably) tries to satisfy a given StTL specification. A key element of this tech- nique is to use a quantitative measure of the satisfaction of a given StTL specification. We call this measure the robustness estimate. The precise definition of the robustness estimate is provided by quantitative semantics for StTL. To define the quantitative semantics, we define a parameterization operation that converts a given StTL formula to a PStTL formula. We assume an unbounded supply of formula parameter variables, and replace each numeric constant appearing in the StTL formula with a formula parameter. For brevity, we skip the formal definition. Furthermore, for ease of exposition, we assume an empty π τ , i.e. we do not replace any intervals appearing in the temporal operators. Robustness Definition. We assume that the co-domain X of the random variables constituting the stochastic output process of the modelM is a metric space. This allows us to define a metric d X over X. We can use the standard Euclidean distance as a metric for both X and [0, 1], i.e., the domain for the probability threshold parameters. Suppose the StTL formula has m value parameters and m probability threshold parameters, i.e, P π ⊆ (X m ∪ [0, 1] m ). We can extend the metricd X overX to a metricd Pπ over the setP π in standard fashion. Given a valuation ν mapping π toP π , we define an open ε-ball (denoted B(ν,ε)) around the valuationν as the set of all valuationsν 0 such thatd Pπ (ν(π),ν 0 (π))<ε. The quantitative semantics are then defined in terms of the distance of the given StTL formula to the satisfaction boundary. 17 Definition 2.6 (Satisfaction Boundary). Letϕ(π) be a PStTL formula. For a given stochas- tic process, the formula ϕ(π) partitionsP π into disjoint subsetsP > π ,P ⊥ π andD B such that: ∀ν(π)∈P > π ,∃ε> 0 :x| =ϕ(ν(π)) ∧∀ν 0 ∈B(ν,ε) :x| =ϕ(ν 0 (π)) ∀ν(π)∈P ⊥ π ,∃ε> 0 :x2ϕ(ν(π)) ∧∀ν 0 ∈B(ν,ε) :x2ϕ(ν 0 (π)) P B π =P π \ (P > π ∪P ⊥ π ) The setP B π is called satisfaction boundary. By convention, we assume that for allν(π)∈P B π , x| =ϕ(ν(π)). In other words, points on the boundary are assumed to satisfy the formula. Essentially, the setsP > π andP ⊥ π are defined as open sets containing satisfying and violating valuations, respectively, andP B π is the intersection of the closure of these two sets. We now define the notion of distance to boundary: Definition 2.7 (Distance to Boundary). For a given valuation ν, the distance to boundary d B is defined as follows: d B (ν) = min{d Pπ (ν,ν 0 )|ν 0 ∈P B π } (2.6) Observe that the above minimization is a convex optimization problem subject to equality constraints, which can be solved using Karush-Kuhn-Tucker (KKT) conditions [28]. How- ever, solving this optimization problem requires knowledge of the satisfaction boundary, which, even in the case of STL, is a computationally challenging problem [29]. Thus, we propose a technique to approximate the quantitative semantics that allow us to estimate this distance. We first define the notion of the robustness signal for a single atomic predicate in Definition 2.8. In the definition below, we recall that F x (x,t) denotes the cumulative distribution function ofx at timet. Also,f −1 denotes the inverse function off that appears in the signal predicate. 18 Definition 2.8 (Predicate Robustness). Consider an atomic predicate of PStTL: μ := P(f(x(t))≤c)≥. Let ν be some fixed valuation. We define a valuation ν ? as follows: ν ? ( B ) = F x (f −1 (ν(c)),t) (2.7) ν ? (c B ) = f(F −1 x (ν(),t)) (2.8) The first expression above gives the valuation for the parameter when f(x(t)) is exactly ν(c). The second one gives the valuation of the parameter c for which the probability of f(x(t))≤ν ? (c) is exactly ν(). Then, the predicate robustness is defined as follows: ρ μ (x,t) = (Δc, Δ) = (ν ? (c B )−ν(c),ν ? ( B )−ν()) (2.9) Lemma 2.1. The parameter valuations ν 0 = (ν ? (c B ),ν()) and ν 00 = (ν(c),ν ? ( B )) both belong to the satisfaction boundary. We omit a formal proof, but essentially observe that ν ? as defined maps both c and to values that are marginally satisfied by the atomic predicate. Any negative perturbation to ν 0 or positive perturbation toν 00 must lead to the corresponding atomic predicates no longer being satisfied by the stochastic process x. Thus, by definition, as ν 0 and ν 00 do not have an equi-satisfiable open neighborhood, they must lie onD B . We illustrate the idea for the predicate robustness signal for a fixed time t in Fig. 2.2. We are now ready to define the robustness signal for an arbitrary StTL formula. 19 Definition 2.9 (Robustness Signal). Let ρ μ (x,t) denote the predicate robustness signal for predicateμ andν be a valuation function that assigns a value to each parameter in the PStTL formula ϕ(p). The robustness signal ρ(ϕ,x,t) is defined recursively as follows: ρ(true,x,t) = (∞, 1−) ρ(μ,x,t) = ρ μ (x,t) ρ(¬ϕ,x,t) = −ρ(ϕ,x,t) ρ(ϕ 1 ∧ϕ 2 ,x,t) = min(ρ(ϕ 1 ,x,t),ρ(ϕ 2 ,x,t)) ρ(ϕU I ψ,x,t) = sup t 0 ∈t⊕I (min(ρ(ψ,x,t 0 ), inf t 00 ∈[t,t 0 ) ρ(ϕ,x,t 00 )) (2.10) where ρ μ (x,t) is the predicate robustness as given by Def. 2.9. We remark that the robustness computation looks very similar to that of STL, but there are some nuances to keep in mind. Consider the formula¬(P(x(t) > c) > ). Note that this formula is equivalent toP(x(t) > c)≤ . In other words, the signal predicate remains unchanged, only the comparison to the probability threshold gets negated. However, we can show by a simple argument, that if the robustness ofP(x(t) > c) > is (Δc, Δ), then the robustness ofP(x(t)>c)≤ is indeed (−Δc,−Δ), or as defined in Eq. (2.10). In other words, the definition of robustness as a signed metric preserves the Boolean sat- isfaction semantics of StTL. Also for the formulatrue, the robustness (in the value domain) is infinite, and any given valuation for the probability threshold is 1− away from the sat- isfaction boundary. Given that the predicate robustness is a pair (Def. 2.9), the remaining operators in the definition above are applied element-wise to each element of the tuple. Robustness estimate computation. Definition 2.10 gives a recursive recipe to compute the robustness estimate for a given StTL formula. We note that it suffices to compute the robustness for three operators: ¬,∧, U I , as all the other Boolean and temporal operators can be expressed as functions of these. We note that the quantitative semantics defines operations similar to robustness estimation algorithms for STL, and can thus we can leverage 20 Figure 2.2: Distance to Boundary and Predicate Robustness the efficiency and data structures used for monitoring of STL formulas [30]; with the crucial difference being the computation of the predicate robustness. Remark 2.1 (Spatial vs. Temporal Robustness). We observe that by omitting time-domain parameters, our definition of the robustness signal currently includes only spatial robustness. If we include time-domain parameters, we can also recover a notion of time-robustness of the given stochastic model to an StTL formula. This notion of time-robustness would sup- port dilations or contractions of time intervals of temporal operators to find the satisfaction boundary. For ease of exposition, we defer a full treatment to future work. Robustness-Based Synthesis. We now present our solution for synthesizing control parameters for a given controller. We make a crucial observation: the robustness estimate preserves Boolean satisfaction of an StTL formula ϕ, i.e. if both elements of the robustness estimate pair are positive, then ϕ is satisfied by the given stochastic process x. Thus, given a controlled stochastic modelM(C), maximizing the robustness estimate of its output w.r.t. a given StTL specification ϕ, (by exploring the control parameter space) pushes the model M(C) towards satisfying ϕ. Formally, let x ν(q) (t) denote the output stochastic process, then the controller synthesis problem is formalized as follows: ν ? (q) = arg max ν(q)∈Pq ρ(ϕ,x ν(q) ) (2.11) 21 To avoid requiring to place restrictions on the system or the formula, we address this problem using stochastic optimization, more specifically we adopt a Particle Swarm Opti- mization (PSO) approach. PSO is a prominent optimization method that does not require the objective function to be differentiable, which perfectly fits our problem as the robust- ness signal is not differentiable in general. PSO works by keeping track of a population of S candidate controller parameters (called particles). These particles are moved around in the controller parameter space using simple formulas. These movements are guided by each particle’s best known position and velocity as well as the swarm’s best known position. Algorithm 1 gives the pseudo-code of PSO for the robustness maximization problem given by Eq. (2.11). PSO is a fairly simple algorithm: we pick a certain number of particles randomly initialized to a “position”. We use the function uniform random to indicate uniform random sampling in the interval represented by its arguments. The position of each parti- cle represents a valuation of the controller parameters. For each particle we track its best known position, i.e., the valuation for the particle that yields the best robustness estimate. Additionally, we also track the best position in the entire swarm of particles. Each particle position is updated randomly based on hyper-parameters ω v ,ω p andω g . Although PSO can deal with unbounded controller parameter spaces, in practice it is often preferable to place a lower boundν ` and an upperν u bound on the controller parameter valuations. Additionally, the hyper-parametersω v ,ω p ,ω g are design choices that control the performance and efficacy of the method. For methods to choose these parameters the interested reader could look into [31]. Finally, the termination condition could be the maximum number of iterations or a suitable threshold ε on the robustness value. 2.4 Specification Mining for StTL In this section, we address the problem of specification mining for a given PStTL formula. Recall that in this problem, we are given a controlled stochastic systemM(C), where the 22 Algorithm 1: Robustness via Particle Swarm Optimization 1 procedure PSORobustness(ρ,S,ν ` ,ν u ,ω v ,ω p ,ω g ): 2 numIterations← 0 3 Initialize swarm’s best position ν ∗ ∼ uniform random (ν ` ,ν u ) 4 for each particle i = 1,...S do 5 Initialize particle position ν i ∼ uniform random (ν ` ,ν u ) 6 Initialize best particle position ν ∗ i ←ν i 7 if ρ(ϕ,x ν i (q) )>ρ(ϕ,x ν ∗ (q) ) then 8 Update swarm’s best position ν ∗ ←ν ∗ i 9 end 10 Initialize velocity v i ∼ uniform random (−|ν u −ν ` |,|ν u −ν ` |) 11 end 12 while (ρ(ϕ,x ν ∗ (q) )<ε) or (numIterations< maxIterations) do 13 for each particle i = 1,...S do 14 for each dimension d = 1,...,|q| do 15 Generate random numbers r p ,r g ∼ uniform random (0, 1) 16 v i,d ←ω p r p (ν ∗ i,d −ν i,d ) +ω g r g (ν ∗ −ν i,d ) 17 v i,d ←ω v v i,d +v i,d 18 end 19 ν i ←ν i +v i 20 if ρ(ϕ,x ν i (q) )>ρ(ϕ,x ν ∗ i (q) ) then 21 Update particle’s best position ν ∗ i ←ν i 22 if ρ(ϕ,x ν ∗ (q) )>ρ(ϕ,x ν ∗ i (q) ) then 23 Update swarm’s best position ν ∗ ←ν ∗ i 24 end 25 end 26 end 27 numIterations← numIterations + 1 28 end 29 return ν ∗ controller parameters are fixed, i.e.,C =C(ν ∗ (q)), for some fixed ν ∗ . Recall that we denote the stochastic output process of the model by x ν ∗ (q) (t), or x(t) in brief. Given a PStTL formula ϕ(π), the specification mining problem tries to infer the valuation ν ? for π such that the resulting StTL formula ϕ(ν ? (π)) is marginally satisfied by x(t). This problem is closely related to finding under-approximations for the setsP > π andP ⊥ π , and thus, an over- approximation forP B π . A brute-force approach for finding the satisfaction boundary is to consider a finite set of points G in the parameter space (for example a grid), and define a 23 function S that maps each valuation ν corresponding to a point in G to> if it belongs to P > π , and to⊥ if it belongs toP ⊥ π . Consider the following sets: d P > π ={ν|ν∈G∧x| =ϕ(ν(p))} d P ⊥ π ={ν|ν∈G∧x2ϕ(ν(p))} (2.12) An attractive option is then to somehow seek to generlize the sets d P > π and d P ⊥ π to ap- proximate the satisfaction boundary. However, for general PStTL formulas, given two given satisfying (resp. violating) valuations, there may be convex combinations of these valuations that are not satisfying (resp. violating). Example 2.1. Consider the formula G [0,10] P(x 1 > c)≥ 0.5∧P(x 2 ≤ c)≥ 0.5. Let u(x) denote the Heaviside step function, i.e. u(x) = 1 if x≥ 0 and 0 otherwise. Let CDFs for x 1 and x 2 be respectively u(x 1 − 2) and u(x 2 − 1) for all time t. Then, the formula is satisfied if ν(c)∈ (1, 2], but is not satisfied otherwise. A sampling method that only considers valuations ν(c)≤ 1 and ν(c)> 2, would miss the satisfying region altogether! In addition to the unsatisfying mathematical properties of the sampling-based approach, it can be a computationally expensive procedure. For example, using a grid-based approach, where we pick M grid points along each parameter dimension, requires the satisfaction of ϕ(π) to be evaluated atM |π| points. An identical problem has been identified for parameter inference for STL, where it has been shown in [14, 32] that for the monotonic fragment of Parametric-STL, the inference problem can be efficiently solved. Furthermore, in [33, 29], the authors have shown that if the satisfaction function is monotonic the satisfaction boundary can be efficiently approximated to a desired degree of precision. We now show that for the monotonic fragment of PStTL, the satisfaction boundary can also be computed efficiently. We also propose a new specification mining algorithm for a reduced subset of PStTL (where each parameter name is unique and there are no time-domain or probability threshold parameters), the satisfaction boundary can be computed exactly. 24 Definition 2.10 (Monotonic PStTL). A PStTL formula ϕ is said to be monotonically in- creasing in parameter π i if condition (2.13) holds for any stochastic process x and is said to be monotonically decreasing in parameter π i if condition (2.14) holds for every x. ν(π i )≤ν 0 (π i ) ∧ x| =ϕ(ν(π i )) =⇒ x| =ϕ(ν 0 (π i )) (2.13) ν(π i )≥ν 0 (π i ) ∧ x| =ϕ(ν(π i )) =⇒ x| =ϕ(ν 0 (π i )) (2.14) A multi-parameter formula ϕ(p) is said to be monotonic if and only if it is monotonic in each parameter π i ∈ p. In case of STL, it has been shown by several previous works [32, 14, 20, 19] that the monotonic fragment of STL captures a large number of useful properties in practice. We argue that by extension monotonic PStTL formulas also capture a broad class of interesting uncertainty-aware temporal specifications. We note that by restricting our attention to monotonic PStTL formulas, it is not possible to express formulas such as those that reuse a single formula parameter in several signal predicates, or interval bounds on temporal operators where both bounds of the interval are timed formula parameters. In what follows, we assume without loss of generality that the given formula is monotoni- cally increasing with respect to each parameter. (There is no loss of generality because we can always replace a monotonically decreasing parameter π j with a new parameter π 0 j =−π j .) We now present our specification mining algorithm, which is based on the observation that, given the monotonicity assumption, it suffices to find the parameter valuation at which the validity of the formula changes. Least Permissive Valuation Signal. We assume that the index set of the stochastic process is some sequence of discrete time-indices, i.e.,T ={t 0 ,...,t T }, where t 0 = 0. Our algorithm performs a recursive traversal of the PStTL formula. For each subformula, the algorithm iterates over all the time-points in the horizon of the subformula to compute the least permissive valuation signal. This signal, denotedρ lpv (ψ,π,t) gives the least valuationν ∗ 25 for parameters in π such that ψ(ν ∗ (π)) is satisfied by x. We include the formula parameter π in definingρ lpv (ψ,π,t) as a way of indicating that theρ lpv function will return a valuation for the specified formula parameters inπ at timet. We now present the details of computing ρ lpv (ψ,π,t). If the subformula μ is an atomic predicate μ =P(x(t)≤c)≥ 1−δ, (where δ∈ [0, 1] is some numeric constant), then, ρ lpv (μ,c,t) is the smallest valuation ν(c) such thatP(x(t)≤ ν(c))≥ 1−δ is true. Such a valuation can be easily computed from the inverse CDF of x for the 1−δ value, as shown in the first case. For other subformulas ψ of the given formula, the ρ lpv (ψ,π,t) signal is calculated as shown in Alg 2. For a negation, the set of formula parameters for which ρ lpv is returned remains unchanged, but the sign of the least permissive values is flipped. For Boolean combinations, as the formula parameters in each subformula are disjoint,ρ lpv returns the concatenation of least permissive valuations for both subformulas. For temporal operators, the least permissive valuation requires further explanation. For ease of exposition, we first explain the special cases for G and F operators. Consider the formula G I P(x≤c)> 1−δ, whereδ∈ [0, 1], andI⊆T are known constants. At each time, the ρ lpv signal for the atomic predicate (μ for short) will return the smallest value of c that makes the predicate satisfied at that time. Now, to satisfy the outer G operator, we need to pick the maximum among the least permissive valuations for c in the intervalI. This will guarantee that the Gμ formula holds at time 0. Similarly, for a formula of the type F I , we should choose the minimum value for c over the interval I. Now, to compute the least permissive valuation for a subformula of the type ϕ 1 (π 1 )U I ϕ 2 (π 2 ), we first note that the ρ lpv signal will consist of concatenation of least permissive values of π 1 and π 2 . These values can be computed from the semantics of U from a fashion similar to that for G and F. Theorem 2.1. Letx(t) be a stochastic process,ψ(π) be a monotonically increasing stochastic STL formula parametrized by a set of formula parameters belonging toP π , and let [0,T ] be a discretized time interval. Let≺ denote a total order on the formula parameter space that 26 Algorithm 2: Least Permisive Valuation Computation 1 procedure LPVCompute(ψ,π): 2 switch ψ(π) do 3 caseP(f(x)<c)≥ 1−δ do 4 ∀t :ρ lpv (ψ,c,t) =f(F −1 x (1−δ,t)) 5 end 6 case¬ϕ(π) do 7 ∀t :ρ lpv (ψ,π,t) =−ρ lpv (ϕ,π,t) 8 end 9 case ϕ 1 (π 1 )∧ϕ 2 (π 2 ) do 10 ∀t :ρ lpv (ψ, (π 1 ∪π 2 ),t) =ρ lpv (ϕ 1 ,π 1 ,t)∪ρ lpv (ϕ 2 ,π 2 ,t) 11 end 12 case ϕ 1 (π)U I ϕ 2 (π) do 13 ∀t :v 1 (t) =ρ lpv (ϕ 1 ,π 1 ,t) 14 ∀t :v 2 (t) =ρ lpv (ϕ 2 ,π 2 ,t) 15 ∀t :ρ lpv (ψ, (π 1 ∪π 2 ),t) = (max t 00 ∈[t,t 0 ) v 1 (t 00 ), min t 0 ∈t⊕I v 2 (t 0 )) 16 end 17 end 18 return ρ lpv (ψ,π,t) respects the monotonicity order (i.e. ν(π)≺ ν 0 (π) is equivalent to ν(π) < ν 0 (π) when π is monotonically increasing, and is equivalent to ν(π) > ν 0 (π) when it is monotonically decreasing. Algorithm 2 gives a unique partitioning ofP π defined as follows: P π =P > π ∪P ⊥ π ∪P B π (2.15) ∀ν∈P > π ,∀ν 0 ∈P π ⊥ :ν 0 ≺ν (2.16) P B π =ρ lpv (ψ,π, 0) (2.17) Proof. The proof follows from the definition of the≺ relation, and the following observation: ∀ν∈P > π ,∀ν 0 ∈P ⊥ π ,∀ν ∗ ∈P B π :ν 0 ≺ν ∗ ≺ν 27 The complexity of the proposed algorithm isO(|π|T ), radically better than theO(M |π| ) complexity of the aforementioned grid method. Our algorithm exploits the monotonicity in order to find the boundary at which the formula transits from true to false without redundantly searching over all the parameter space. In addition to that, we should also note that in real-world applications we are primarily interested in finding the satisfaction boundary rather than any validating or falsifying parameter valuation. For example, consider the formula: ϕ = G [0,∞] (P [x<c]> 0.99) (2.18) where x is a stochastic process that models the velocity of a car. Obviously, by setting c = 200mph, we would obtain a valid formula. However, such a valuation is not informative about the behavior of the car. It would be much more informative to know that the velocity never exceeds 60mph rather than it never exceeds 200mph. Therefore, the most informative valuations are those that lie on the satisfaction boundary, which is exactly what Algorithm 2 returns. Robustness and Boundary estimation. We remark that computing the robustness w.r.t. a given parameter valuation for a PStTL formula is a straightforward implementation of the quantitative schematics given in Eq. 2.10. In other words, it computes (Δc, Δ) for a single point (ν(c),ν()) in the parameter space. However, suppose we are given a PStTL formula that effectively defines a family of StTL formulas (by changing the valuations for the formula parameters). Suppose we wish to compute the robustness estimate of the given stochastic model at each of these StTL formulas. Then, repeated application of the above procedure would be an inefficient way to obtain the robustness values. Instead, if we have full knowledge of the satisfaction boundary of the PStTL formula, then obtaining the robustness algorithm for any single valuation in the formula parameter space becomes a constant-time lookup operation from the known boundary. 28 2.5 Experimental Results To validate the above-mentioned formalism, we consider an experimental analysis consisting of two parts. In the first part, we verify the correctness of the specification mining algorithm proposed in Sec. 2.4. In the second case study, we apply our unified framework for mining specifications and designing a controller in a real-world system, more specifically in a Model Predictive Adaptive Cruise Controller. We show the block diagram of the controller in Fig. 2.4. 2.5.1 Evaluation of the Specification Mining Algorithm For the first case study, we performed experimental measurements of sensor readings, actu- ator inputs, driver actions and environmental conditions in a vehicle driven by one of the co-authors. We recorded several descriptive quantities of the vehicle’s behavior, including position, velocity, acceleration, fuel consumption, throttle position, power and torque output. We use a standard off-the-shelf OBD (On-board Diagnostics) port logging device to obtain this data. We illustrate our proposed specification mining algorithm by using the measured vehicle speed, which is modeled as a stochastic process x(t). To obtain an approximation of the PDF p(x,t) of the vehicle’s velocity we applied the ensemble statistics method using a Gaussian function as kernel. Following that, we define the following parametric StTL formula for the vehicle’s speed: φ 1 (c 1 ,c 2 ) = G [0,30] (P [x≥c 1 ]≥ 0.9 =⇒ F [0,10] P [x≤c 2 ]≥ 0.9) (2.19) To find the parameter valuations which validate or falsify the formula, we implemented and tested two methods: 1. Grid Method: We chose c 1 and c 2 to be in the range [0, 20] and [0, 45] respectively and sampled 100 points along each dimension. Then, we created a 2 dimensional grid 29 Figure 2.3: Comparison of Algorithm 2 against the grid method. Red denotes the satisfaction boundary. points and tested the validity of the formula at each one of them. This required 100 2 computational steps. 2. Specification Mining (Algorithm 2): We can easily verify that the formula is monotonic with respect to bothc 1 andc 2 . Therefore, the assumptions of Algorithm 2 are satisfied. To test the algorithm we discretized the time horizon (T = 30s) into 100 time instances and applied Algorithm 2. Since we the formula contains|P| = 2 parameters, this required 2· 100 computational steps. The results of the two methods are shown in Fig. 2.3. Observe that the satisfaction boundary given by Algorithm 2 approximates the one given by the grid method almost per- fectly 5 This experimental result validates the accuracy of our specification mining algorithm. 2.5.2 MPC-based Adaptive Cruise Controller The second case study is inspired by the challenges arising in the semi-autonomous vehicle domain. Our experimental setup consists of a Lead and an Ego car that are traveling on the same lane of the freeway. The state of the ego (resp. lead) car is the positionp e (t) (resp. p l (t)) and the velocityv e (t) (resp. v l (t)). The control input to the ego car is the acceleration a e (t) 5 There are a couple of discrepancies between satisfaction boundary computed and its depiction in the figure. First, in plotting the results of the grid method, we assumed the formula to be conservatively false at all points where the satisfaction value was unknown. Thus, the boundary indicated by the colors underapproximates the validity domain. The second discrepancy is that at a few points, the red line seems to be above a point indicated as a true valuation by the grid method. This is due to the resolution of plotting. 30 which is given by a Model Predictive Controller (MPC) based Adaptive Cruise Controller (ACC). Although the actual acceleration a l (t) of the lead vehicle is unknown, the values of the relative distance d rel (t) and relative velocity v rel (t) are estimated by on-board sensors installed on the ego car. These estimates are corrupted by Gaussian sensor noise of unknown statistics. For both the ego vehicle and the lead vehicle, the dynamics between acceleration and velocity are modeled by the following transfer function: G(s) = 1 s(0.5s + 1) (2.20) which approximates the dynamics of the throttle body and vehicle inertia [34]. The ACC operates in two distinct modes: 1. ifd rel (t)>d safe (t) the speed control mode is active. The goal is to reach and maintain a reference velocity v ref 2. ifd rel (t)<d safe (t) the spacing control mode is active. The goal is to maintain the safe distance d safe The safe distance between the lead car and the ego car is a function of the ego car velocity: d safe (t) =d default +t g v ego (t) (2.21) where d default is the preset default spacing between the cars and t g is the gap time. A common rule of thumb for the gap time is the two second rule which states that the ego car should ideally stay 2 seconds behind the lead car. The default spacing was set at 10 meters. The MPC-based ACC computes optimal control actions such that the system obeys the constraints imposed by physical restrictions on the velocity and acceleration and adapts to the driving behavior of the lead car according to the two modes of operation. The behaviour of the controllerC(K) is adjusted by the parameter K∈ [0, 1], whereby smaller values produce a more conservative controller with smoother control actions and larger values 31 G(s) a L (t) C(K) G(s) MPC ACC Ego Car Lead Car a E (t) p L (t) v L (t) p E (t) v E (t) Figure 2.4: Block Diagram of the MPC-based ACC System 0 10 20 30 40 50 60 70 80 time (s) 0 50 100 150 c 1 (m) Minned Parameter c 1 Conservative Controller Balanced Controller Aggressive Controller 0 10 20 30 40 50 60 70 80 time (s) 5 10 15 20 25 30 c 2 (m/s) Minned Parameter c 2 Conservative Controller Balanced Controller Aggressive Controller (a) 0 10 20 30 40 50 60 70 80 time (s) 0 0.5 1 1.5 2 (x, safe ,t) Satisfaction Signal Conservative Controller Balanced Controller Aggressive Controller 0 10 20 30 40 50 60 70 80 time (s) 0 20 40 60 80 100 (x, safe ,t) Robustness Signal Conservative Controller Balanced Controller Aggressive Controller (b) Figure 2.5: (a) Mined Parameters for a conservative (K = 0.1), balanced (K = 0.5) and aggressive (K = 0.9) controller. (b) Satisfaction and Robustness signals for Eq. 2.23 (c 1 = 60m,c 2 = 30m/s). The satisfaction signals overlap, and hence only one of them is visible. produce a more aggressive controller with a faster response time. We implemented the MPC ACC system in Simulink and extracted the PDFs of the quantities of interest (position and velocity) by simulating the modelM = 100 times and using the ensemble statistics method. Specification Mining. In this part of the experimental study, we address the specification mining problem. Since safety is a critical component of such a system, we chose to test our specification mining algorithm using the safety formula ϕ safe as indicated in Eq. 2.23. ψ := P[d rel ≤c 1 ]≥ 0.9 =⇒ F [0,10] G [0,∞] P[v e ≤c 2 ]≥ 0.9 (2.22) ϕ safe := G [0,80] ψ (2.23) This formula states that whenever the relative distance drops below the thresholdc 1 with probability at least 0.9, then in the subsequent 10s the velocity of the ego vehicle should drop below c 2 with probability at least 0.9. Note that this formula is monotonically increasing 32 with respect to both parameters. For the purposes of this case study we tested three different controller behaviors, namely a conservative (K = 0.1), a balanced (K = 0.5) and an aggres- sive (K = 0.9) controller. For purpose of illustrating our mining algorithm, we indicate the least permissive valuations for the parameters of the sub-formula ψ shown in Eq. (2.22) in Figure 2.5a, i.e., instead of plotting the satisfaction boundary in thec 1 ,c 2 −parameter space, we plot the pair of parameters that lead to a valid formula over time. This implies that, at each time instance, every pair of valuesc 1 ,c 2 as shown in Fig. 2.5a is a validating valuation. The relation between the controller parameter K and the formula parameters c 1 and c 2 can be observed empirically through the figures. The balanced controller is safer with respect to this specification as for higher bounds on the relative distance it leads to lower bounds on the velocity of the ego vehicle. Robustness Computation. In addition to mining specifications, we also tested the ro- bustness computation according to the schematics given by Eq. 2.10 for the three aforemen- tioned controller parameter valuations. To this end, we fixed the parameters of the formula in Eq. (2.23) to c 1 = 60m,c 2 = 30m/s. These values are chosen such that they obey the two second rule. Looking at Fig. 2.5a we observe that the mined parameter c 1 reaches the value of 60 m in the time interval [40, 50] and in that interval the value of mined parameter c 2 lies well below 30 m/s for all three controllers. Given that the formula is monotonically increasing we expect that the aforementioned chosen parameters will lead to a valid formula. This is indeed verified by the satisfaction and robustness signals shown in Fig. 2.5b. In addition to that, we verify that the balanced controller performs better with respect to this specification as it leads to higher robustness values. Robust Design. For the last part of the experimental results we test the controller design method. Specifically, we solve the optimization problem stated in Eq. (2.11) using the Particle Swarm Optimization approach. We utilize the robustness of the PStTL formula given by Eq. (2.23) in the objective function and we seek to find the controller parameter valuationK ? which maximizes the average robustness. To test the design method for different 33 Figure 2.6: Controller Design using PSO to maximize the robustness of Eq. (2.23) for different values of c 1 ,c 2 . formula parameters, we discretizec 1 andc 2 by sampling uniformly 10 points for each of them in the intervals [10, 100] and [5, 30], respectively. We set the maximum tolerance on the objective function change to 10 −6 and run all the simulations on an Intel Core i7-7700HQ @ 2.8GHz with 16 GB of RAM. The run-time for all parameter values ranged from 0.2247s to 0.2840s. The results for the optimal controller parameter value K ? are shown in Fig. 2.6. Note that the points have been interpolated for the ease of presentation. Fig. 2.6 shows the power of using a robustness-guided control design; in spite of uncertainty in the environment, for each parameterized specification, we can pick a controller parameter K ? for which the system demonstrates good behavior. 2.6 Related Work and Conclusions Signal Temporal Logic [17] has been used for expressing properties and abstractions of closed- loop cyber-physical system models [18, 19, 20, 21]. There is also growing interesting in using temporal logic specifications for controller synthesis of systems with uncertainty. Under the assumption of a known stochastic dynamical model, there have been proposed approaches to find optimal control policies that maximize the probability of satisfying a given temporal logic specification by planning over stochastic abstractions [35, 36, 37, 38]. When the underlying 34 model is unknown, reinforcement learning based methods have been used for specifications given as LTL [39] and STL [40] formulas. Our work builds on recent work to extend STL to systems with uncertainty, specifi- cally work on Probabilistic STL (PrSTL) [22], Chance Constrained Temporal Logic (C2TL) [9], and Stochastic STL (StSTL) [23]. The atomic predicates in the logic C2TL are affine constraints on variables (or arbitrary Boolean combinations of such constraints), where the coefficients in the constraints are assumed to random variables with a Gaussian distribu- tion. The signal variables themselves are assumed to evolve deterministically. For this logic, the authors show how the synthesis problem with a C2TL objective can be conservatively encoded as a mixed integer linear program. In [22], the authors define PrSTL that uses probabilistic atomic predicates parameterized with a time-varying random variable drawn from a given distribution. The authors then synthesize a receding horizon controller that satisfies PrSTL specifications using as a mixed integer semi-definite encoding of the synthesis problem. The work in [23] focuses on using StSTL specifications as contracts for verification using a similar mixed integer linear programming formulation. Each of these logics sepa- rately contains some of the elements that we desire for a Stochastic Temporal Logic; our logic fundamentally differs in spirit. While previous logics focus on specific uncertainties in an otherwise deterministic system, the target of our proposed logic is reasoning about arbitrary stochastic dynamical systems, where the source of uncertainty could be in the environment or the system itself. Our work heavily draws on our previous work in [24], which introduces Stochastic Tem- poral Logic. While our prior work introduces Boolean semantics for StTL, in this paper, we propose a new quantitative semantics for StTL. While [24] also provides a parameter inference algorithm, we believe that our presentation in this paper formalizes many of the concepts. Furthermore, our paper also makes a connection between parameter inference and different notions of robust satisfaction for StTL. Finally, this paper suggests a practical use 35 for leveraging robustness values as a heuristic for feedback control-based satisfaction of StTL specifications, which is not considered in [24]. Our work on robust satisfaction semantics for StTL draws on previous work on quanti- tative semantics for MTL [41] and STL [42]. The crucial difference is that our calculation of robustness is w.r.t. a stochastic process, and not a signal. This difference is crucial as it requires knowledge of the time-varying PDF (through a model based on first-principles or a data-driven model), and the robustness of atomic predicates is significantly different than that for STL or MTL. Our work on parameter inference proposed in this paper was developed independently of the work that appears in [43]. However, our algorithm closely mimics the algorithm in [43]. The signal that we call the least permissible valuation signal is called the parameter validity signal in [43]. A key difference is the parameter inference for atomic predicates, which in our case are chance constraints, while that in [43] is for signal predicates. The algorithm in [43] is however more general than the one presented here, as it reasons about propagation of Pareto fronts of satisfying parameter valuations, while the restrictions we impose on PStTL allows us to not have to consider this. In contrast to extensions to the logical formalism adopted by the previous works, the approach in [44] uses a different interpretation of a given STL formula to define probabilistic quantitative semantics. The key idea is to extend the notion of robustness for STL formulas to stochastic systems, by considering distributions over robustness values induced by the stochasticity in the system. The authors then define quantitative semantics using notions of average and conditional average robustness. Also of note are approaches that use Probabilistic Computation Tree Logic (PCTL), that was introduced to expresses properties over the realizations of Markov chains and Markov Decision Processes [45] A significant amount of research has focused on using PCTL for verification and optimal control of Markov Decision Processes [46, 47, 48, 49]. We remark 36 that these approaches rely on the system structure to perform model checking or to design control strategies w.r.t. a given PCTL formula. Limitations and Future Work. Our approach for control design essentially considers only parameterized control laws where the controller parameters are unknown. Furthermore, the optimization procedure that we use is a best effort, black-box optimization procedure that does not provide any theoretical guarantees on convergence or error bounds. It would be interesting to consider approaches such as statistical model checking [50, 51] to prove the properties of the controllers synthesized using our procedure. It would be also interesting to consider extensions to machine learning based control architectures such as [40]. Our technique for mining specifications is also currently limited to template specifications that are specified by the user. In the future, we will consider combining structure learning techniques such as [52, 53] with the formula parameter inference procedure. One direction for further investigation relates to the estimation of the PDFs, where we currently use the ensemble statistics method. The precision of estimating the PDFs is affected by the selection of the kernels. In order to bound the error on the robustness or determine the number of required simulation traces to attain a specific precision in estimating the PDFs and implicitly attain an error bound in the robustness computation, we will investigate advanced probability theory techniques such as the Hoeffding’s inequality, the concentration inequality or the generalized central limit theorem arguments. Conclusion. In this chapter, we present Stochastic Temporal Logic (StTL) as an amalgam of previous efforts to extend Signal Temporal Logic to stochastic system. Our definition al- lows us to use StTL as a specification formalism for arbitrary controlled stochastic dynamical systems. After defining StTL, we introduce the quantitative schematics to reason about the degree of satisfaction of an StTL specification, we present an algorithm for their computation and we propose a method for designing optimal controllers by maximizing the robustness of the given specification using a Particle Swarm Optimizer. Additionally, we develop an efficient specification mining algorithm which is based on iterating over the time horizon of 37 each sub-formula and finding the least permissive parameter valuation that yields a valid formula. Our algorithm correctly identifies the satisfaction boundary in the parameter space where the validity of the formula changes and outperforms the grid method. We demonstrate and validate our framework and algorithms in two case studies from the automotive domain. 38 Chapter 3 Learning Topological Features 3.1 Introduction For the following two chapters, we shift our attention to learning problems arising in CPS the main topic of interest is the handling of topological and hierarchical features that are especially pervasive in such applications. Revisiting the Smart City example of Sec 1.1, it is evident that the plethora of data collected by sensor and vision systems in such a CPS would have highly unstructured nature characterized by recursive patterns, hierarchical features and topological representations. Therefore, for the following two chapters we set out to uncover such intricacies in CPS and to develop novel machine learning frameworks that will allow us to fully leverage the collected data at full capacity. We start in this chapter by addressing the issue of topological features and utilize persistent homology as our main tool. Persistent homology is a topological data analysis tool which tracks how topological features (e.g. connected components, cycles, cavities) appear and disappear as we analyze the data at different scales or in nested sequences of subspaces [54, 55]. A nested sequence of subspaces is known as a filtration. As an informal example of a filtration consider an image of variable brightness. As the brightness is increased, certain features (edges, texture) may become less or more prevalent. The birth of a topological feature refers to the ”time” (i.e., the brightness value) when it appears in the filtration and the death refers to the ”time” when it disappears. The lifespan of the feature is called persistence. Persistent homology summarizes these topological characteristics in a form of multiset called persistence diagram, which is a highly robust and versatile descriptor of the data. Persistence diagrams enjoy the stability property, which ensures that the diagrams of two similar objects are similar [56]. 39 Additionally, under some assumptions, one can approximately reconstruct the input space from a diagram (which is known as solving the inverse problem) [57]. However, despite their strengths, the space of persistence diagrams lacks structure as basic operations, such as addition and scalar multiplication, are not well defined. The only imposed structure is induced by the Bottleneck and Wasserstein metrics, which are notoriously hard to compute, thereby preventing us from leveraging them for machine learning tasks. Related Work. To address these issues, several vectorization methods have been pro- posed. Some of the earliest approaches are based on kernels, i.e., generalized products that turn persistence diagrams into elements of a Hilbert space. Kusano et al. [58] propose a persistence weighted Gaussian kernel which allows them to explicitly control the effect of persistence. Alternatively, Carri` ere et al. [59] leverage the sliced Wasserstein distance to define a kernel that mimics the distance between diagrams. The approaches by Bubenik [60] based on persistent landscapes, by Reininghaus et al. [61] based on scale space theory and by Le et al. [62] based on the Fisher information metric are along the same line of work. The major drawback in utilizing kernel methods is that they suffer from scalability issues as the training scales poorly with the number of samples. In another line of work, researchers have constructed finite-dimensional embeddings, i.e., transformations turning persistence diagrams into vectors in a Euclidean space. Adams et al. [63] map the diagrams to persistence images and discretize them to obtain the embedding vector. Carri` ere et al. [64] develop a stable vectorization method by computing pairwise distances between points in the persistence diagram. An approach based on interpreting the points in the diagram as roots of a complex polynomial is presented by Di Fabio [65]. Adcock et al. [66] identify an algebra of polynomials on the diagram space that can be used as coordinates and the approach is extended by Kaliˇ snik in [67] to tropical functions which guarantee stability. The common drawback of these embeddings is that the representation 40 is pre-defined, i.e., there exist no learnable parameters, therefore, it is agnostic to the spe- cific learning task. This is clearly sub-optimal as the eminent success of deep learning has demonstrated that it is preferable to learn the representation. The more recent approaches aim at learning the representation of the persistence dia- gram in an end-to-end fashion. Hofer et al. [68] present the first input layer based on a parameterized family of Gaussian-like functionals, with the mean and variance learned dur- ing training. They extend their method in [69] allowing for a broader class of parameterized function families to be considered. It is quite common to have topological features of in- finite persistence [54], i.e., features that never die. Such features are called essential and in practice are usually assigned a death time equal to the maximum filtration value. This may restrict their expressivity because it shrinks their importance relative to non-essential features. While we may be able to increase the scale sufficiently high and end up having only one trivial essential feature (i.e., the 0-th order persistent homology group that becomes a single connected component at a scale that is sufficiently large), the resulting persistence diagrams may not be the ones that best summarize the data in terms of performance on the underlying learning task. This is evident in the work by Hofer et al. [68] where the authors showed that essential features offer discriminative power. The work by Carri` ere et al. [70], which introduces a network input layer the encompasses several vectorization methods, em- phasizes the importance of essential features and is the first one to introduce a deep learning method incorporating extended persistence as a way to deal with them. In this paper, we approach the issue of essential features from the geometric viewpoint. We are motivated by the recent success of hyperbolic geometry and the interest in extending machine learning models to hyperbolic spaces or general manifolds. We refer the reader to the review paper by Bronstein et al. [71] for an overview of geometric deep learning. Here, we review the most relevant and pivotal contributions in the field. Nickel et al. [72, 73] propose Poincar´ e and Lorentz embeddings for learning hierarchical representations of symbolic data and show that the representational capacity and generalization ability outperform Euclidean 41 embeddings. Sala et al. [74] propose low-dimensional hyperbolic embeddings of hierarchical data and show competitive performance on WorldNet. Ganea et al. [75] generalize neural networks to the hyperbolic space and show that hyperbolic sentence embeddings outperform their Euclidean counterparts on a range of tasks. Gulcherhe et al. [76] introduce hyperbolic attention networks which show improvements in terms of generalization on machine trans- lation and graph learning while keeping a compact representation. In the context of graph representation learning, hyperbolic graph neural networks [77] and hyperbolic graph convo- lutional neural networks [78] have been developed and shown to lead to improvements on various benchmarks. However, despite this success of geometric deep learning, little work has been done in applying these methods to topological features, such as persistence diagrams. The main contribution of this chapter is to bridge the gap between topological data analysis and hyperbolic representation learning. We introduce a method to represent per- sistence diagrams on a hyperbolic space, more specifically on the Poincare ball. We define a learnable parameterization of the Poincare ball and leverage the vectorial structure of the tangent space to combine (in a manifold-preserving manner) the representations of individ- ual points of the persistence diagram. Our method learns better task-specific representations than the state of the art because it does not shrink the relative importance of essential fea- tures. In fact, by allowing the representations of essential features to get infinitesimally close to the boundary of the Poincare ball, their distance to the representations of non-essential features approaches infinity, therefore preserving their relative importance. To the best of our knowledge, this is the first approach for learning representations of persistence diagrams in non-Euclidean spaces. 3.2 Persistent Homology In this section, we provide a brief overview of persistent homology leading up to the definition of persistence diagrams. We refer the interested reader to the papers by Edelsbrunner et 42 al. [54, 55] for a detailed overview of persistent homology. An overview of homology can be found in the Appendix. Persistent Homology. Let K be a simplicial complex. A filtration of K is a nested sequence of subcomplexes that starts with the empty complex and ends with K, ∅ =K 0 ⊆K 1 ⊆...⊆K d =K. (3.1) A typical way to construct a filtration is to consider sublevel sets of a real valued function, f :K→R. Let a 1 <···x 1 }. Also, let n be a homology dimension and consider the sublevel set filtration induced by a function f : K→ R over the complex K. Then, a persistence diagram,D n (f), is a multiset of the formD n (f) ={x : x∈ R 2 ∗ }∪ Δ constructed by inserting each point (a i ,a j ) for i<j with multiplicity μ i,j n (or μ i,∞ n if it is an essential feature). We denote the space of all persistence diagrams withD. Definition 3.2 (Wasserstein distance and stability). LetD n (f),E n (g) be two persistence diagrams generated by the filtration induced by the functions f,g :K→R, respectively. We define the Wasserstein distance w q p (D n (f),E g (g)) = inf η X x∈D kx−η(x)k p q 1/p , (3.5) where p,q∈N and the infimum is taken over all bijections η :D n (f)→E n (g). The special casep =∞ is known as Bottleneck distance. The persistence diagrams are stable with respect to the Wasserstein distance if and only if w q p (D n (f),E g (g))≤kf−gk ∞ . Note that a bijectionη between persistence diagrams is guaranteed to exist because their cardinalities are equal, considering that, as per Def. 3.1, the points on the diagonal are added with infinite multiplicity. The strength of persistent homology stems from the above 44 Complex K, Filtration f D n (f) birth death Δ • x 1 • x 2 • • • • φ◦ρ(x) y 1 y 1 +θ y 2 y 2 +θ T x 0 B x 0 y y z Φ(D,θ)∈B Figure 3.1: Illustration of our method: Initially, the points are transferred via the auxiliary transformation ρ and the parame- terizationφ to the Poincare ballB, where learnable parametersθ are added. Then, the logarithmic map is used for transforming the points to the tangent space Tx 0 B. Finally, the resulting vectors are added and transformed back to the manifold via the exponential map. Note that the persistence diagram is mapped to a single point on the Poincare ball (i.e., Φ(D,θ)∈B). stability definition, which essentially states that the map taking a sublevel function to the persistence diagram is Lipschitz continuous. This implies that if two objects are similar then their persistence diagrams are close. 3.3 Persistent Poincare Representations In this section, we introduce our method (Fig. 3.1) for learning representations of persistence diagrams on the Poincare ball. We refer the reader to the Appendix for some fundamental concepts of differential geometry. The Poincare ball is an m-dimensional manifold (B,g B x ), whereB ={x∈R m :kxk< 1} is the open unit ball. The space in which the ball is embedded is called ambient space and is assumed to be equal toR m . The Poincare ball is conformal (i.e., angle-preserving) to the Euclidean space but it does not preserve distances. The metric tensor and distance function are as follows g B x =λ 2 x g E λ x = 2 1−kxk 2 d B (x,y) = arccos 1 + 2 kx−yk 2 (1−kxk 2 )(1−kyk) 2 , (3.6) where g E =I m is the Euclidean metric tensor. Eq. 3.6 highlights the benefit of using the Poincare ball for representing persistence diagrams. Contrary to Euclidean spaces, distances 45 in the Poincare ball can approach infinity for finite points. This space is ideal for representing essential features appearing in persistence diagrams without squashing their importance relative to non-essential features. Informally, this is achieved by allowing the representations of the former ones to get infinitesimally close to the boundary, thereby their distances to the later ones approach infinity. Fig. 3.2 provides an illustration. We gradually construct our representation through a composition of 3 individual trans- formations. The first step is to transfer the points to the ambient space (i.e., R m ) of the Poincare ball. LetD 1 be a persistence diagram. We introduce the following auxiliary transformation ρ :R 2 ∗ →R m . (3.7) This auxiliary transformation is essentially a high-dimensional embedding and may contain learnable parameters. Nonetheless, our main focus is to learn a hyperbolic representation and, therefore, we assume thatρ is not learnable. Later in this section, we analyze conditions on ρ to guarantee the stability and expressiveness of the hyperbolic representation. The second step is to transform the embedded points from the ambient space to the Poincare ball. When referring to points on a manifold, it is important to define a coordinate system. A homeomorphism ψ : B → R m is called coordinate chart and gives the local coordinates on the manifold. The inverse map φ : R m →B, is called a parameterization ofB and gives the ambient coordinates. The main idea is to inject learnable parameters into this parameterization. The injected parameters could be any form of differentiable functional that preserves the homomorphic property. Differentiability is needed such that our representation can be fed to downstream optimization methods. In our construction, 1 The sublevel set function f and the homology dimension n are omitted. 46 we utilize a variant of the generalized spherical coordinates. Let θ∈R m be a vector of m parameters. We define the learnable parameterization φ :R m ×R m →B as follows y 1 = 1 + 2 π arctanθ 1 r 1 and y i =θ i + arccos x i−1 r i−1 , for i = 2, 3,..m, (3.8) where r 2 i = x 2 m +··· +x 2 i+1 +x 2 i +. The small positive constant is added to ensure that the denominator in Eq. 3.8 is not zero. Intuitively, Eq. 3.8 corresponds to scaling the radius of the point by a factor θ 1 and rotating it by θ i radians across the angular axes. The scaling and rotation parameters are learned during training. Note that the form of y 1 ensures that representation belongs in the unit ball for all values ofθ 1 . The coordinate chart is not explicitly used in our representation; it is provided in the Appendix for the sake of completeness. The third step is to combine the representations of each individual point of the persis- tence diagram into a single point in the hyperbolic space. Typically, in Euclidean spaces, this is done by concatenating or adding the corresponding representations. However, in non- Euclidean spaces such operations are not manifold-preserving. Therefore, we transform the points from the manifold to the tangent space, combine the vectors via standard vectorial addition and transform the resulting vector back to the manifold. This approach is based on the exponential and logarithmic maps exp x :T x B→B and log x :B→T x B. (3.9) The exponential map allows us to transform a vector from the tangent space to the manifold and its inverse (i.e., the logarithmic map) from the manifold to the tangent space. For a general manifold, it is hard to find these maps as we need to solve for the minimal geodesic 47 curve (see Appendix for more details). Fortunately, for the Poincare ball case, they have analytical expressions, given as follows exp x (v) =x⊕ tanh λ x kvk 2 v kvk ! , log x (y) = 2 λ x tanh −1 k−x⊕yk −x⊕y k−x⊕yk , (3.10) where⊕ denotes the M¨ obius addition, which is a manifold-preserving operator (i.e., for any x,y∈B =⇒ x⊕y∈B). The analytical expression is given in the Appendix. The transformations given by these maps are norm-preserving, i.e., for example, the geodesic distance from x to the transformed point exp x (v) coincides with the metric normkvk g induced by the metric tensor g B x . This is an important property as we need the distance between points (and therefore the relative importance of topological features) to be preserved when transforming to and from the tangent space. We now combine the aforementioned transformations and define the Poincare hyperbolic representation followed by its stability theorem. Definition 3.3 (Poincare Representation). LetD ∈ D be the persistence diagram to be represented in an m-dimensional Poincare ball (B,g B x ) embedded in R m and x 0 ∈B be a given point. The representation ofD on the manifoldB is defined as follows Φ :D×R m →B, Φ(D,θ) = exp x 0 X x∈D log x 0 φ(ρ(x)) , (3.11) where the exponential and logarithmic maps are given by Eq. 3.10 and the learnable param- eterization and the auxiliary transformation by Eq. 3.8 and Eq. 3.7, respectively. Theorem 3.1 (Stability of Hyperbolic Representation). LetD,E be two persistence diagrams and consider an auxiliary transformation ρ :R 2 ∗ →R m that is • Lipschitz continuous w.r.t the induced metric normk·k g , • ρ(x) = 0 for all x∈R Δ . 48 Additionally, assume that x 0 = 0. Then, the hyperbolic representation given by Eq. 3.11 is stable w.r.t the Wasserstein distance when p = 1, i.e., there exists constant K > 0 such that d B (Φ(D,θ), Φ(E,θ))≤Kw g 1 (D,E) (3.12) where d B is the geodesic distance and w g 1 is the Wasserstein metric with the q-norm replaced by the induced normk·k g (i.e., the norm induced by the metric tensor g, see Appendix B.2). The proof of Theorem 3.1 (given in the Appendix) results from a general stability theorem [56] and is on par with similar results for other vectorizations [63] or representations [68] of persistence diagrams. One subtle difference is that Theorem 3.1 uses the induced norm rather than the q-norm appearing in the Wasserstein distance. However, since the induced norm implicitly depends on the chosen point x 0 , which, per requirements of Theorem 3.1, is assumed to be equal to the origin, there is no substantial difference. The fact that we require the auxiliary transformationρ to be zero on the diagonal is important to theoretically guarantee stability. Intuitively, this can be understood by recalling (Def. 3.1) that all (infinite) points on the diagonal are included in the persistence diagram. By mapping the diagonal to zero and taking x 0 = 0, we ensure that the summation in Eq. 3.11 collapses to zero when summing over the diagonal. Finally, we note that the assumptions of Theorem 3.1 are not restrictive. In fact, we can easily find Lipschitz continuous transformations that are zero on the diagonalR Δ , such as the exponential and rational transformations proposed by Hofer et al. [68]. Additionally, we note that the assumptions of Theorem 3.1 do not prohibit us from choosing an ”under-powered” or degenerateρ. For example,ρ = 0 satisfies the assumptions and therefore leads to a stable representation. However, such representation is obviously not useful for learning tasks. An implicit requirement, that guarantees not only the stability but the expressiveness of the results representation, is that ρ does not result in any information loss. This requirement 49 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 d = 30 birth death ∞ H 0 H 1 (B,g B x ) g B x =λ 2 x I 2 d∼ −2 ≈ 0 Figure 3.2: Left: Example graph from the IMDB-BINARY dataset. Middle: Persistence diagrams extracted using the Vietoris- Rips filtration. The dashed line denotes features of infinite persistence, which are represented by points of maximal death value equal to 90 (i.e., by points of finite persistence). Right: Equivalent representation on the 2-dimensional Poincare ball. Features of infinite persistence are mapped infinitesimally close to the boundary. Therefore, their distance to finite persistence features approaches infinity (d∼ −2 ). is satisfied by picking a ρ that it is injective, which, given that it is a higher dimensional embedding, is a condition easy to satisfy. In practice, we use a variant of the exponential transformation by Hofer et al. [68]. The exact expression is given in the Appendix. 3.4 Case Studies and Results We present experiments on diverse datasets focusing on persistence diagrams extracted from graphs and grey-scale images. The learning task is classification. Our representation acts as an input to a neural network and the parameters are learned end-to-end via standard gradient methods. The architecture as well as other training details are discussed in the Appendix. The code to reproduce our experiments is publicly available at https://github. com/pkyriakis/permanifold/. Ablation Study: To highlight to what extent our results are driven by the hyperbolic embedding, we perform an ablation study. In more detail, we consider three variants of our method: 1. Persistent Poincare (P-Poinc): This is the original method as presented in Sec. 3.3, 2. Persistent Hybrid (P-Hybrid): Same as P-Poinc with the Poincare ball replaced by the Euclidean space. This implies that the exponential and logarithmic maps (Eq. 3.10) 50 reduce to the identity maps, i.e., exp x (v) = x +v log x (y) = y−x. The learnable parameterization is as in Eq. 3.8. 3. Persistent Euclidean (P-Eucl): Same as P-Hybrid with Eq. 3.8 replaced with simple addition of the learnable parameters, i.e., y =x +θ. Baseline - Essential Features Separation: To highlight the benefit of a unified Poincare representation, we design a natural baseline that treats essential and non-essential features separately. In more detail, for each point (b,d)∈D, we calculate its persistenced−b and then compute the histogram of the resulting persistence values. For essential features, we compute the histogram of their birth times. Then, we concatenate those histograms and feed them as input to the neural network (architecture described in the Appendix). We consider the case where the essential features are included (baseline w/ essential) and the case where they are discarded (baseline w/o essential). Manifold Dimension and Projection Bases: Since our method essentially represents each persistence diagram on a m−dimensional Poincare ball, it may introduce substantial information compression when the points in the diagrams are not of the same order as m. A trivial approach to counteract this issue is to use a high value for m. However, we exper- imentally observed that a high manifold dimension does not give the optimal classification performance and it adds a computational overhead in the construction of the computation graph. Empirically, the best approach is to keep m at moderate values (in the range m = 3 tom = 12), replicate the representationK times and concatenate the outputs. Each replica is called a projection base and for their number we explored values dependant on the number of points in the persistence diagram. Persistence diagrams obtained from images tend to have substantially fewer points than diagrams obtained from graphs. Therefore, for images, we explored moderate values for K, i.e., 5− 10, whereas for graphs we increased K in the range 200− 500. Essentially, we treat both m and K as hyper-parameters, explore their space following the aforementioned empirical rules and pick the optimal via the validation 51 dataset. As we increase m, it is usually prudent to decrease K to maintain similar model capacity. 3.4.1 Graph Classification In this experiment, we consider the problem of graph classification. We evaluate our approach using social graphs from [79]. The REDDIT-BINARY dataset contains 1000 samples and the graphs correspond to online discussion threads. The task is to identify to which community a given graph belongs. The REDDIT-5K and the REDDIT-12K are a larger variant of the former dataset and contain 5K and∼12K graphs from 5 and 11 subreddits, respectively. The task is to predict to which subreddit a given discussion graph belongs. The IMDB-BINARY contains 1000 ego-networks of actors that have appeared together in any movie and the task is to identify the genre (action or romance) to which an ego-graph belongs. Finally, the IMDB-MULTI contains 1500 ego-networks belonging to 3 genres (action, romance, sci-fi). We train our model using 10-fold cross-validation with a 80/20 split. Graphs are special cases of simplicial complexes, therefore, defining filtrations is straight- forward. We use two methods: The first one captures global topological properties and is based on shortest paths. In this case, the graphG = (V,E) is lifted to a metric space (V,d) using the shortest path distanced :V×V→R ≥0 between two vertices, which can be easily proved to be a valid metric. Then, we define the Vietoris-Rips complex ofG as the filtered complex VR s (G) that contains a subset of V as a simplex if all pairwise distances in that subset are less than or equal to s, or, formally, VR s (G) ={(v 0 ,v 1 ,··· ,v m ) :d(v i ,v j )≤s,∀i,j}. (3.13) This approach essentially interprets the vertices of the graph as a point cloud in a metric space, with the distance between points given by the shortest path between the corresponding vertices. In the case of unweighted graph, we assign unit weight across edges. In Fig. 52 Table 3.1: Classification accuracy (mean±std or min-max range, if available). IMDB-M IMDB-B REDDIT-B REDDIT-5K REDDIT-12K baseline w/o ess. 38.43 ±0.98 65.78 ±1.25 67.33 ±1.55 39.30 ±1.12 28.35 ±1.11 baseline w/ ess. 38.78 ±0.85 66.56 ±1.13 69.33 ±0.75 38.45 ±0.98 30.43 ±0.75 WL 49.33 ±4.75 73.40 ±4.63 81.10 ±1.90 49.44 ±2.36 38.18 ±1.31 DGK 44.5 ±0.52 66.96 ±0.56 78.04 ±0.39 41.27 ±0.18 32.22 ±0.10 PSCN 45.23 ±2.84 71.00 ±2.29 86.30 ±1.58 49.10 ±0.70 41.32 ±0.32 WKPI 49.5 ±0.40 75.10 ±1.10 n/a 59.50 ±0.60 48.40 ±0.50 AWE 51.54 ±3.61 74.45 ±5.83 87.89 ±2.53 54.74 ±1.91 39.20 ±2.0 GIN-GFL 49.70 ±2.9 74.50 ±4.60 90.30 ±2.60 55.70 ±2.90 n/a PersLay 48.8-52.2 71.2-72.6 n/a 55.6-56-5 47.7-49.1 GLR n/a n/a n/a 54.5 44.5 P-Eucl 46.45 ±4.03 67.54 ±3.54 71.45 ±2.98 43.15 ±3.12 32.56 ±3.68 P-Hybrid 47.87 ±2.03 72.48 ±4.57 72.87 ±1.75 46.85 ±2.17 36.42 ±4.08 P-Poinc 57.31 ±4.27 81.86 ±4.26 79.78 ±3.21 51.71 ±3.01 42.16 ±3.45 3.2 we show a sample persistence diagram extracted with the Vietoris-Rips filtration and the corresponding representation of the features on the Poincare ball. The second method captures local topological properties and is based on vertex degree. Given a graphG = (V,E), a simplicial complex can be defined as the union of the vertex and edge sets, i.e.,K =K 0 ∪K 1 , where K 0 ={v : v∈ V} and K 1 ={(u,v) : (u,v)∈ E}. The sublevel function is defined as follows: f(v) =deg(v) for v∈K 0 and f(u,v) = max{f(u),f(v)} for (u,v)∈K 1 , where deg(v) is the vertex degree v. Then, the filtration is given by Eq. 3.2. We compare our method against several state of the art methods for graph classification. In particular, we compare against : (1) the Weisfeiler-Lehman (WL) graph kernel [80], (2) the Deep Graph Kernel (DGK) [79], (3) the Patchy-SAN (PSCN) [81], the Weighted Per- sistence Image Kernel (WKPI) [82], (5) the Anonymous Walk Embeddings (AWE) [83], (6) the Graph Isomorphism Network with Graph Filitraton Learning (GIN-GFL) [84]. Addi- tionally, we compare against the PersLay input layer by Carri` ere et al. [70] which utilizes extended persistence as an alternative way to deal with essential features and the Learnable Representation based on Gaussian-like structure elements (GLR) presented by Hofer et al. [68]. 53 We run simulations for different manifold dimensions (ranging fromm = 2 tom = 12) and pick the best one using the mean (across all 10 folds) cross-validation accuracy as criterion. We report the mean and standard deviation for the bestm on Table 7.1. We observe that the performance of the P-Eucl method is poor but the addition of the learnable parameterization (i.e., P-Hybrid method) leads to small but consistent improvement. The best performance is achieved by the fully hyperbolic representation (P-Poinc) and is on par or exceeds the performance of the state of the art. This suggests that the hyperbolic representation is vital for obtaining good performance. Additionally, the baselines have poor performance irrespectively of whether or not essential features have been included via the histogram of their birth times. This indicates that treating essential features separately and including them as additional inputs is not sufficient for good performance. These results support our initial motivation for representing persistence diagrams on the Poincare ball. Also, observe that our method out-performs all the competitor methods for the IMDB datasets, which have a small number of nodes and edges (approx. 10-20 nodes and 60-90 edges) while the REDDIT ones are bigger networks (approx. 400-500 nodes and 500-600 edges). This may indicate that the Poincare representation is best suited for smaller, less complex graphs. 3.4.2 Image Classification Even though it is well known that convolutional neural networks have achieved unprecedented success as feature extractors for images, we present an image classification case-study using topological features as a proof-of-concept for our method. Contrary to graphs, images are not inherently equipped with the structure of a simplicial complex. In theory, we could construct Vietoris-Rips complexes, as in Eq. 3.13, by interpreting pixels as a point cloud. However, this is not the most natural representation of an image, which has a grid structure. Therefore, we exploit this structure by constructing cubical complexes, i.e., unions of cubes aligned on a 2D grid. As in the case of graphs, we use two methods. For the first method, called cubical filtration, we use the grey-scale image directly and represent each pixel as a 54 0 1 2 3 4 5 6 7 8 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 epoch train loss m = 3 m = 9 0 1 2 3 4 5 6 7 8 9 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 epoch validation accuracy m = 3 m = 9 Table 3.2: Classification accuracy MNIST F-MNIST PWGK 75.31 28.13 PI 94.39 31.26 GLR 93.31 56.85 P-Eucl 92.27 30.21 P-Hybrid 93.78 38.45 P-Poinc 95.91 72.28 Figure 3.3: Plotting the train loss (left) and the validation accuracy (middle) over 10 training epochs for the MNIST dataset using two different dimensions for the Poincare ball (m = 3 and m = 9). 2−cube. Then, all of its faces are added to theK-th complex. We obtain a sublevel function by extending the grey-scale value I(v) of pixel a v to all cubes in K as follows: f(σ) = min σ face of τ I(τ),σ∈K. (3.14) In the second experiment, we consider the problem of image classification. We utilize two standardized datasets: the MNIST, which contains images of handwritten digits, and the Fashion-MNIST, which contains shape images of different types of garment (e.g., T-shirt, trouser, etc.). Each dataset contains a total of 70K (60K train, 10K validation, 10 folds) grey-scale images of size 28×28. Both datasets are balanced and categorized into 10 classes. Finally, a grey-scale filtration is obtained using Eq. 3.2. The second method is called height filtration and uses the binarized version of the original image. We define the height filtration by choosing a direction v∈ R 2 of unit norm and by assigning to each pixel p of value 1 in the binarized image a new value equal tohp,vi, i.e., the distance of pixel p to the plane defined by v. This creates a new grey-scale image which is then fed to the aforementioned cubical filtration. We note that the height filtration deserves special attention in persistent homology because complexes to which it is applied can be approximately reconstructed from their persistence diagrams [57]. In practice, we use direction vectors v that are distributed uniformly across the unit cycle. We choose 30 and 50 directions for the MNIST and Fashion- MNIST, respectively. 55 We compare our method against three baselines that encompass all methods for handling persistence diagrams: (1) the Persistence Weighted Gaussian Kernel (PWGK) approach proposed by Kusano et al. [58], (2) the Persistence Images (PI) embedding developed by Adams et al. [63], and (3) the Learnable Representation based on Gaussian-like structure elements (GLR) by Hofer et al. [68]. We run our simulations for manifold dimensions equal to m = 3 andm = 9 and report the best results in Table 3.2. Our method outperforms all other methods, in some cases by a considerable margin. We also study how the manifold dimension affects the train loss and validation accuracy. The results are shown in Fig. 3.3. Observe that for m = 9 the train loss decreases more rapidly and the validation accuracy increases more rapidly in the first few epochs. This suggests that a higher manifold dimension may be slightly better for representing persistence diagrams. Additionally, we observed that the Poincare representation tends to generalize better than its Euclidean counterpart. In fact, the validation accuracy of the P-Eucl method started decreasing after the first 2-3 epochs, whereas the P-Poinc method showed a saturation rather than a decrease. This was observed without modifying the dropout rate or any other hyper-parameters and is a strong empirical finding that demonstrates the superiority of Poincare representations. 3.5 Summary We presented the first, to the best of our knowledge, method for learning representations of persistence diagrams in the Poincare ball. Our main motivation for introducing such method is that persistence diagrams often contain topological features of infinite persistence (i.e., essential features) the representational capacity of which may be bottlenecked when representing them in Euclidean spaces. This stems from the fact that Euclidean spaces cannot assign infinite distance to finite points. The main benefit of using the Poincare space is that by allowing the representations of essential features to get infinitesimally close to the boundary of the ball their distance to non-essential features approaches infinity, therefore 56 preserving their relative importance. Directions for future work include the learning of filtration and/or scale end-to-end as well as to investigate whether or not there exists some hyperbolic trend in distances appearing in persistence diagrams that justify the improved performance especially on small graphs. 57 Chapter 4 Learning Hierarchical Features 4.1 Introduction In this chapter, we continue our exploration of novel representation learning methods focusing in hierarchical features. Without loss of generality, we restrict our exposition in a specific CPS application, in which hierarchical features are highly pervasive. The application of interest in the neural decoding, i.e., decoding visual imagery from brain recordings, which is a key problem in neuroscience. Functional magnetic resonance imaging (fMRI) can be used to decode the perception and schematic information of the cerebral cortex in a non- invasive manner [85].The fMRI monitors the activity state of the cerebral cortex through the blood-oxygen-level-dependent (BOLD) variation induced by brain neural activity. The fMRI spatial resolution is high enough and the recorded noise low enough to allow the development of machine learning pipelines that map voxel-based recordings to the corresponding visual stimuli. Based on the target learning task, visual decoding can be categorized into stimuli classification, stimuli identification, and stimuli reconstruction. The former two tasks aim to predict the object category of the presented stimulus or identify the stimulus from an ensemble of possible stimuli. The reconstruction task, which is the most challenging one and the main focus of this chapter, aims to construct a replica of the presented stimulus image from the fMRI recordings. Related Work. The proposed methods for the problem of neural imagery decoding can be broadly classified in three categories: non-deep learning methods, non-generative deep learning methods and generative deep learning methods. In the non-deep learning class belong methods that are based on primitive linear models and aim in reconstructing low-level 58 Figure 4.1: Brain Encoding and Decoding image features [86]. Such approaches first extract handcrafted features from natural images, such as multi-scale image bases [87] or Gabor filters [88], and then learn a linear mapping from the fMRI voxel space to the extracted features. Due to their simplicity, linear models are not able to reconstruct complex natural images and thus their applicability is restricted to simple images containing only low-level features. Methods that use convolutional neural networks as well as encoder-decoder architectures belong in the non-generative deep learning class. Horikawa et al. [89] demonstrated a homology between human and machine vision by designing an architecture with which the features extracted from convolutional neural networks can be predicted from fMRI signals. Based upon those findings, Shen et al. [90] used a pretrained VGG-19 model to extract hierarchical features from stimuli images and learned a mapping from the fMRI voxels in the low/high area to the corresponding low/high VGG-19 features. Beliy et al. [91] designed a CNN-based Encoder-Decoder architecture, where the encoder learns a mapping from the stimulus images to the fMRI voxels and the decoder learns the reverse mapping. By stacking the components back-to-back, the authors introduced two combined networks E-D and D- E, whose inputs and outputs are natural images and fMRI recordings, respectively. This allowed them to train their networks using self-supervision, thereby addressing the inherent scarcity of fMRI-image pairs. Following up on that work, Gaziv et al. [92] improved the reconstruction quality by training on a perceptual similarity loss function, which is calculated by first extracting multi-layer features from both the original and reconstructed images and 59 comparing the extracted features layer-wise. Such a perceptual loss is known to be highly effective in assessing the image similarity and accounts for many nuances in the human vision [93]. In the generative deep learning class we have model architectures, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), which learn a probability distribution over the input data. Shen et al. [90] extended their original method to make the reconstructions look more natural by conditioning the reconstructed images to be in the subspace of the images generated by a GAN. A similar GAN-prior was used by Yves et al. in [94], where the authors also introduced unsupervised training on natural images. Fang et al. [95] leverage the hierarchical structure of the information processing in the visual cortex to propose two decoders, which extract information from the low and high visual cortex areas, respectively. The output of those decoders is used as a conditioning variable in a GAN-based architecture. Shen et al. [96] trained a GAN using a modified loss function that includes an image-space and perceptual loss in addition to the standard adversarial loss. A line of work by Seeliger et al. [97], Mozafari et al. [98] and Qiao et al. [99] assumes that there exists a linear relationship between the brain activity and the GAN latent space. These methods use the GAN as a natural image prior to ensure that the reconstructed image has some ”naturalness” properties. The work by VanRullen et al. [100] and Ren et al. [101] utilize VAE-GANs [102], a hybrid model in which the VAE decoder and GAN generator are combined into one. In the former work, the authors use the VAE to extract meaningful representations of the data and learn a linear mapping between the latent vector and the fMRI patterns. In the later work, the authors propose a dual-VAE architecture where both the natural images and fMRI voxels are converted into latent representations, which are then fed as conditioning variable in a GAN. Contributions. In this chapter, we purpose a novel architecture for the problem of decoding visual imagery from fMRI recordings. Motivated by the fact that the visual path- way in the human brain processes stimuli in a hierarchical manner, we postulate that such 60 a hierarchy can be captured by the latent space of a deep generative model. More specif- ically, we use Hierarchical Variational Autoencoders (HVAE) in order to learn meaningful representations of stimuli images and we train an ensemble of deep neural networks to learn mappings from the voxel space to the HVAE latent spaces. Voxels originating from the early stages of the visual pathway (V1, V2, V3) are mapped to the earlier layers of latent variables, whereas the higher visual cortex areas (LOC, PPA, FFA) are mapped to the later stages of the latent hierarchy. Our architecture replicates the natural hierarchy of visual information processing in the latent space of a variational model. To the best of our knowledge, this is the first approach to use HVAEs in the context of neural decoding. 4.2 Visual Information Processing In this section we give a brief overview of the visual information processing in the human brain and describe the two streams hypothesis, which we use in our experimental architecture. Visual information received from the retina of the eye is interpreted and processed in the visual cortex. The visual cortex is located in the posterior part of the brain, at the occipital lobe, and it is divided into five distinct areas (V1 to V5) depending on the function and structure of the area. Visual stimuli received from the retina travel to the lateral geniculate nucleus (LGN), located near the thalamus. LGN is a multi-layered structure that receives input directly from both retinas and sends axons to the primary visual cortex (V1). V1 is the first and main area of the visual cortex where visual information is received, segmented, and integrated into other regions of the visual cortex. Based on the two streams hypothesis [103], following V1, visual stimuli can take the dorsal pathway or ventral pathway. The dorsal pathway consists of the secondary visual cortex (V2), the third visual cortex (V3), and the fifth visual cortex (V5). The dorsal stream, informally known as the ”where” stream, is responsible for visually-guided behaviors and localizing objects in space. The ventral stream, also known as the ”what” stream, consists of V2 and fourth visual cortex (V4) areas 61 Figure 4.2: Two stream hypothesis of visual information processing in the human brain. and is responsible for processing information for visual recognition and perception. Visual processing occurs hierarchically at three distinct levels. The low-level includes the retina, lateral geniculate nuclei (LGN), and the primary visual cortex (V1). Low-level processing is the initial step when interpreting an image and it is the place where orientation, edges, and lines are perceived. Sequentially, the mid-level processing consists of the secondary (V2), third (V3) and fourth (V4) which extract shapes, object and colors. Finally, the high-level processing consists of category-selective areas such as the fusiform face area (FFA), lateral occipital (LOC), parahippocampal area (PPA) and medial temporal area (MT/V5). These areas show selective response to faces, objects/animals, places and motion, respectively. Despite the evident hierarchical structure of visual information processing, most current methods for neural decoding fail to fully exploit that fact. Current methods take into account the hierarchy of visual information processing either by mapping the fMRI voxel to hierarchical CNN-extracted image features via regression models [90, 104] or by training an end-to-end DNN model on a feature loss function [94, 96]. The major issue with such approaches is that the hierarchy is taken into account in the feature space of a CNN model, which is, in general, complex, high-dimensional space. In this work, we propose to take into account the aforementioned hierarchy in the latent space of a deep model. Latent spaces 62 are known to produce compact, low-dimensional embeddings of the data and, therefore, are more appropriate for encoding the stimuli images. The intuition behind our method is that each brain area, being ”responsible” for a certain set of features, better be mapped on a compact, low-dimensional representation of those features. For example, given that V1 is broadly responsible for encoding low-level features (e.g., edges, orientations), it is sensible to map the fMRI voxels from the V1 region onto a representation of the underlying images features; and this mapping is much easier to be learned on the latent space, rather than the feature space. 4.3 Method Leveraging the aforementioned intuition, we introduce a neural decoding method that mimics the hierarchical visual information processing in the latent space. Our architecture has two main components: a Hierarchical Variational Autoencoder (HVAE) and a Neural Decoder. The HVAE is used for learning compact, hierarchical latent representations of natural images and is trained using self-supervision. The Neural Decoder is used for mapping the brain signals to the HVAE hierarchical latent space and is trained via supervision on{fMRI,Image} pairs. In this section, we describe each of the components in more detail. Our architecture is visualized in Fig 4.3 for the special case of 2 latent hierarchical layers. 4.3.1 Hierarchical Variational Autoencoders To capture the inherent hierarchical structure of visual information processing, we propose to model natural images via a family of probabilistic models known as Hierarchical Varia- tional Autoencoders (HVAEs). HVAEs extend the basic Variational Autoencoder (VAEs) by introducing a hierarchy of latent variables. Formally, let x be a natural image and z ={z 1 , z 2 ,..., z L } be a set of L latent variables. The generative distribution or decoder is 63 Figure 4.3: Outline of our method: a) We pretrain a Hierarchical Variational Autoencoder on a large set of natural images. Two layers of latent variables z 1 ,z 2 are inserted after each encoder (EB) and decoder (DB) block. b) We train the Neural Decoder using by discarding the encoder from the previous step and learning a map from the fMRI voxels to the hierarchical latent space. The lower visual cortex (V1, V2, V3) is mapped to z 1 and the higher visual cortex (FFA, PPA, LOC) to z 2 . defined as p θ (x|z) =p θ (x|z 1 ) Q L i=1 p θ (z i+1 |z i ) and is parametrized by θ. The prior distribu- tion is defined as p(z) =p(z 1 ) Q L i=1 p(z i+1 |z i ). The posterior p(z|x) is approximated by the variational distribution or encoder q φ (z|x) =q φ (z 1 |x) Q L i=1 q φ (z i+1 |z i ), which is parametrized byφ. Both the prior and the approximate posterior are represented by factorial Normal dis- tributions. The variational principle provides a tractable lower bound, known as Evidence Lower Bound (ELBO), on the log-likelihood as follows logp θ (x)≥E q φ (z|x) h log p θ (x, z) q φ (z|x) i =L(θ,φ; x) (4.1) =−KL(q φ (z|x||p θ (z))) +E q φ (z|x) [logp θ (x|z)], (4.2) where KL is the Kullback-Leibler divergence. The encoder and decoder are implemented by deep neural networks and their parameters are jointly optimized using gradient descent on the ELBO criterion. Similarly to standard VAEs, the reparametrization trick [105, 106] is used to allow us to back-propagate the gradient thought the stochastic sampling involved in the computation of Eq. 4.2. For the encoder part of our HVAE, we use a pretrained VGG-19 model [107]. This is a deep convolutional neural network of 19 convolutional layers and 3 fully connected layers. 64 We use the weights from the model pretrained on ImageNet and discard the fully connected layers. We introduce latent variables by taking the output of a given convolutional layer, flattening it, passing it through a fully connected layer and, finally, through a variational layer which outputs the latent variable. This latent variable is re-sampled to avoid any dimension mismatch, and rerouted back to the main block, where it is aggregated with the output of the convolutional layer. Depending on how many latent layer we would like to insert, their exact position may vary. As an empirical design choice we choose to insert the latent layers equally spaced and after a convolutional block. A latent layer is always inserted at the output of the penultimate convolutional block. The decoder part of our HVAE transforms the hierarchical latent variables to output images and consists of 4 transposed convolutional layers. The number of decoder filters are [128, 64, 32, 16, 3] and all kernel sizes are set to 5. Each transposed convolutional layer is followed by a 2d batch normalization and a ReLU non-linearity. The output of each transposed convolutional layer is interleaved with the latent variables. More specifically, each latent variable is initially passed thought a fully connected layer, re-sampled to avoid dimension mismatch and then aggregated with the output of the corresponding transposed convolution. Similarly to the encoder, we insert the latent variable such that we ensure symmetry and we always insert the penultimate latent variable before the first transposed convolution. 4.3.2 Neural Decoder We now leverage the latent space of the HVAE to learn a set of maps from the fMRI voxel space to the hierarchical latent variables. In more detail, each region of interest (ROI ) is mapped via a dense neural network to a specific subset of the latent space. Brain regions in the earlier states of the visual pathway are mapped to the earlier layers of the latent hierarchy, whereas voxels from the higher visual cortex areas are mapped to the deep layers in the latent hierarchy. We assume that the HVAE has L groups of latent variables z 1 , z 2 ,... z L 65 and that the fMRI voxels are partitioned into n non-overlapping brain regions of choice, i.e., y 1 , y 2 ,... z L . Formally, the Neural Decoder is a set of maps from the i-th brain region to the i-th group of latent variables. Each of these maps is represented by a deep neural network with parameters w i , i.e., z i =ψ w i (y i ), i = 1, 2,...L. The reconstruction ˆ x is then obtained by passing the latent variables z ={ψ w 1 (y 1 ),ψ w 2 (y 2 ),...ψ w L (y L )} through the decoder model p(x|z) defined in Sec. 4.3.1. The loss function used for training the Neural Decoder is an important design choice. Classic per-pixel measures, such as Euclidean distance, commonly used for regression prob- lems, or the related Peak Signal-to-Noise Ratio (PSNR), are insufficient for images, as they assume pixel-wise independence. Therefore, to encourage the Neural Decoder to learn recon- structions guided by human visual perception, we use a perceptual loss. Perceptual loss is a class of loss functions that relies on the fact that CNNs extract hierarchical features. More specifically, deep features trained on supervised, self-supervised and unsupervised objectives are an effective model of human visual perceptual similarity [93]. For a given image x and its reconstruction ˆ x, their perceptual loss is: l(x, ˆ x) = X l 1 H l W l X h,w ||b l (f l x −f l ˆ x )|| 2 2 , (4.3) where f l x ,f l ˆ x are the layer-wise activations of a given, pretrained CNN model, b l ∈R C l is a channel-wise scaling vector. Intuitively, the perceptual loss in Eq. 4.3 extracts features for both the target and reconstructed image and then compares the features layer-wise using the Euclidean norm. To ensure that no bias is introduced during learning, it is important that the CNN used for evaluating Eq. 4.3 is different than the one used for the encoder. In our implementation we use a pretrained AlexNet as well as the code provided by Zhang et al. [93] to compute the perceptual loss. 66 4.3.3 Model Training We start the training process by first deciding the number and position of the latent layers. The choice is guided by the type of fMRI data that we have as well as the level of latent space coarse-graining that we can achieve. For instance, if our fMRI data contain only the primary (V1) and the secondary (V2) visual cortex then we have two choices: a) we can either consolidate all voxels into a single vector and have a single latent layer in our HVAE or b) we can have two vector containing the voxels from each brain area and train the HVAE such that it has two latent layers z 1 , z 2 (example shown in Fig. 4.3). Naturally, if our fMRI data are more fine, we can add additional latent layers. Following this design choice, the training proceeds in two phases: In the first phase, we pretrain the HVAE via self-supervision using the ELBO loss function Eq. 4.2 on a large ensemble of 50,000 natural images from the ImageNet database. This phase gives us mean- ingful latent representations and allows the HVAE decoder to adapt to the statistics of a large set of natural images. In the second phase, the HVAE encoder is discarded, the HVAE decoder is kept fixed and the Neural Decoder is trained on supervised{fMRI, Image} pairs using the perceptual loss function Eq. 4.3. In this phase, we essentially learn a map from the voxels of each brain area to the corresponding latent layer and then use that latent vector to reconstruct the image. 4.4 Experimental Results To evaluate the utility of our method in practice, we carry out a series of experimental simulations. To measure the performance of our method, we use both qualitative comparisons of the reconstructions as well as quantitative metrics. In what follows, we give the details of the dataset used, the metrics implemented and baseline comparisons. Dataset: We applied our pipeline on a commonly used, publicly available dataset known as Generic Object Decoding (GOD). The dataset consists of high-resolution (500× 500) 67 stimuli images and their corresponding fMRI recordings. There exist 1250 (1200 train, 50 test) stimuli images selected from 200 object categories from the ImageNet database and the fMRI recording were obtained from 5 healthy subjects. The train- and test-fMRI data consist of 1 and 35 (repeated recordings) per presented stimulus, respectively. We use the post-processed fMRI data provided by Horikawa et al. [89], which contain voxels from 7 brain areas (V1,V2,V3,V4,FFA,PPA,LOC). Ablation Study: We perform an ablation study, with the number of hierarchical layers and, consecutively, the number of brain regions, being the ablated parameter. Motivated by the two stream hypothesis (Sec 4.2) for the neural processing of visual information , we consider the following variants: 1. Naive Baseline (NB): We consider only one latent layer z NB and all fMRI voxels are mapped to z NB . 2. Primary-Secondary (PS): We consider 2 latent layers z V 1 , z V 2 and the voxels from V 1,V 2 are mapped to the corresponding latent layer. 3. Dorsal Pathway (DP): We consider the 3 latent layers z V 1 , z V 2 , z V 3 and voxels from V 1,V 2,V 3 are mapped to the corresponding latent layer. 4. Ventral Pathway (VP): We consider 4 latent layers z V 1 , z V 2 , z V 4 , z PF and the voxels from V 1,V 2,V 4,{FFA,PPA} are mapped to the corresponding latent layer. Note that voxels from FFA and PPA are merged into a single area. Metrics: The reconstruction quality is assessed both subjectively, i.e., by visual inspec- tion of the output test images and comparison with the ground truth, as well as objectively. Our quantitative evaluation relies on metrics that encode the spatial dependence such as the Pearson Correlation Coefficient (PCC) and the Structural Similarity Index Measure (SSIM). Pearson Correlation Coefficient (PCC): This metric is extensively used in statistics to measure the linear dependence between variables. In the context of image similarity, PCC 68 is computed on the flattened representations of the two images. The limitation of PCC is its sensitivity to edge intensity or misalignment, which makes the metric assign larger value to blurry images [91]. Structural Similarity Index Measure (SSIM): Wang et al. proposed SSIM in [108] as a metric that quantifies the characteristics of human vision. Given a pair of images p,q, SSIM is computed as a weighted combination of luminance, contrast and structure. Assuming equal contribution of each measure, SSIM is first computed locally in a common window of size N×N, and then the global SSIM is computed by averaging the SSIM over all non-overlapping windows. These image similarity metrics defined are used for computing the correct identification rate in an n-way classification task. Let M∈{PCC,SSIM} be a metric of choice, ˆ p i be a reconstructed image and P i be a set containing the ground truth p i and a set of n− 1 randomly selected target images. The Correct Identification Rate (CIR) is defined as follows: CIR n M = 1 N N X i=1 1 i = arg max p j ∈P i M(ˆ p i ,p j ) , (4.4) where N is the total number of images and the indicator function 1(·) has the value of 1 if the argument is true and 0 otherwise. The CIR n M metric is essentially the frequency at which a reconstructed image can correctly identify the ground truth among n− 1 randomly selected additional images. The chance level is 1/n. Main Results: We compare the performance of our method against several state of the art methods (SOTA) for the problem of neural decoding. The competitor methods are: the encoder-decoder based self-supervised method by Belyi et al. [91], the end-to-end, GAN- based pipeline by Shen et al. [96], the GAN-conditioned method by Shen et al. [90] and the shape-schematic GAN by Fang et al. [95]. Figure 4.4 shows qualitative results and compares our method against the aforementioned competitors. All displayed images were reconstructed from the test-fMRI dataset. To improve the signal-to-noise ratio, the test 69 Figure 4.4: Qualitative comparison with other methods. fMRI samples corresponding to the same category are averaged across trials. The results shown were obtained using the Ventral Pathway variant, which gave the best performance. We directly use the reconstructions reported in the respective papers by the authors. Our method tends to consistently produce more faithful reconstructions. Note that, even though the GAN-based decoders tend to produce more natural images, the reconstructions may deviate significantly from the stimulus image. This is because the GAN is introduced as a natural imaged prior, as noted by Belyi et al. [91]. On the contrary, our method reconstructs the stimuli more faithfully, albeit the reconstructions appearing as a noisier version of the ground truth. It is evident that the qualitative comparison highlights a trade-off between the naturalness of the reconstructed stimuli and the pixel-wise noise introduced in the reconstructions. To resolve the ambiguity, we perform an additional quantitative comparison using the CIR n metric. For this part we compare against the method by Belyi et al. [91] as well as the two variants of the method by Shen et al. [90]. We directly compare against the results as reported by the authors of [91]. The results are shown in Fig. 4.5. For our method, we report the correct identification rate obtained using the Ventral Pathway variant and we average across the metrics (CIR n PCC and CIR n SSIM ). We observe that our method consistently 70 Figure 4.5: Correct identification ratio for different methods. Figure 4.6: Qualitative comparison for different pathways. outperforms the competitors and, particularly in the 5-way and 10-way case, by a substantial margin. Additionally, we observe our method shows a small performance drop as we increase n, i.e, from 90% in the 2-way case to 79% in the 10-way case, whereas the performance loss for the competitor method is substantially higher. This performance is due to the following fact: Even though our method gives noisier reconstructions that the competitors, the high-level features such as color, texture and shapes are retained and, therefore, the task of identifying the correct ground truth from the reconstruction is easier. In contrast, please observe in Fig. 4.4 that the competitor methods may substantially alter the color or texture of the image, therefore leading to more frequent ground truth misidentification. 71 In the next experiment, we evaluate the decoding performance of different visual path- ways. The results are shown on Fig. 7.3. Qualitatively, the ventral stream seems to be producing the best reconstructions, which is expected from a neuroscience perspective, given that this pathway’s purpose is for visual perception and contains high level areas (FFA-PPA) for object recognition. The dorsal stream produces comparably good reconstruction results due to the pathway’s function in identifying objects in spatial location. Interestingly enough, even though the Naive Baseline contains all the available brain areas (V1, V2, V3, V4, FFA, PPA, LOC), the reconstruction quality is inferior, especially in the 2nd and 3rd images, which are far more complex. This is because the Naive Baseline uses all voxels to learn the Neural Decoder map z = ψ w (y) and fails to take into account the fact that each brain area is ”responsible” for specific image features. On the contrary, the remaining methods partition the brain areas into (progressively finer) segments and map the voxels from each area onto the hierarchical latent space of the HVAE decoder. This is a strong evidence that the hierarchical structure of the visual information processing needs to be explicitly taken into account when constructing neural decoding models and provides empirical validation for our method. Also, observe that the reconstruction quality of the 1st and 4th stimuli images (which are less complex) does not differ much across the different pathways. Finally, the reconstruction of the last stimuli is considered failed as no pathway is able to replicate it faithfully. This may be because this stimuli image contains only very nuanced details (complex object shape, similar foreground and background color) or because these details are encoded via the interaction between the ventral and dorsal stream or the feedback signals sent between brain areas. The later issue can be perhaps addressed by refining our model to take into account explicit latent space interdependences and feedback signals. This is a potential extension of our model and interesting directions for future work. To further support our aforementioned claim, we present quantitative results on Table 4.1. On this table we give the the n-way correct identification rate CIR n for n = 2, 5, 10, for all ablated variants and for both metrics (PCC and SSIM). The results on this table validate 72 CIR 2 PCC CIR 2 SSIM CIR 5 PCC CIR 5 SSIM CIR 10 PCC CIR 10 SSIM NB 0.77 0.78 0.64 0.66 0.57 0.58 PPS 0.80 0.82 0.72 0.73 0.65 0.67 DP 0.88 0.90 0.81 0.80 0.75 0.75 VP 0.91 0.92 0.84 0.85 0.79 0.79 Table 4.1: Then-way correct identification rate (n = 2, 5, 10) for all ablated variants using the Person Correlation Coefficient (PCC) and the Structural Similarity Index Measure (SSIM) as a selection criterion. We report the mean across subjects. The inter-subject standard deviation was in the range of 0.02− 0.05. The chance levels are 0.5, 0.25, 0.10, respectively. Figure 4.7: Learning curves for CIR n M ,n = 2, 5, 10 andM∈{PCC,SSIM} across all subjects. The horizontal axis is the number of epochs. Subject 3 is marginally outperforming the other subjects and Subject 1 gives the worst performance (figure best viewed in color). the aforementioned qualitative observations. The identification accuracy is progressively increasing as we partition the brain into more fine areas and as we add hierarchical layers in the HVAE onto which the brain areas are mapped. Furthermore, we observe that the SSIM given a modest but consistent identification performance gain compared to PCC. Therefore, we empirically verify that the SSIM should be the metric of choice when evaluating the reconstruction performance of neural decoding models and validates the use of SSIM as proxy for quantification human visual perception. To verify that our method can be successfully applied to all subjects and study potential inter-subject variation of the results, we show in Fig. 4.7 the learning curves for all 5 subjects and for all metricsCIR n M ,n = 2, 5, 10 andM∈{PCC,SSIM}. The metrics were calculated using the test samples and the ventral pathway variant. Even though the metrics appear 73 similar across subject, after careful examination of the curves some subtle discrepancies and trends can be observed. Subject 1 is consistently performing approximately 5% worse across all metrics whereas Subject 3 is marginally outperforming the other subjects by 2%. The fact that the Subject 3 gives the best reconstructions has been verified in previous studies [92] and is attributed to differences in the signal-to-noise ratio across subjects. Finally, Fig. 4.7 allows us to study how training progresses and validate that no overfitting occurs. We observe that, in all cases, the metrics saturate at about 800 epochs, which gives us an empirical estimate of how many iterations our model needs to achieve good performance. 4.5 Conclusion We addressed the problem of neural decoding from fMRI recordings and proposed a novel ar- chitecture inspired by neuroscience. More specifically, motivated by the fact that the human brain processes visual stimuli in a hierarchical fashion, we postulated that this structure can be captured by latent space of a hierarchical variational autoencoder (HVAE). Our HVAE serves as a proxy to learning meaningful latent representations of stimuli images and can be pretrained on a large dataset of high-resolution images. Following that, we train our Neural Decoder to learn a map from the fMRI voxel space to the HVAE latent space. Our architecture replicates the visual information processing in the human brain in the sense that earlier visual cortex areas (e.g., primary-secondary visual cortex) are mapped to the earlier latent layers, whereas voxels from the higher visual cortex (e.g., PPA, FFA areas) are mapped to the later latent layers. We validated our approach using fMRI recordings from a visual presentation experiment involving 5 subjects and compared against other methods. Our work paves the way to constructing better models to replicate human perception and understanding the nuances of human visual reconstruction, both of which could utilized to better understand the brain, assist people with visual disabilities and perhaps in decoding imagery during sleep. 74 Chapter 5 Safe Control under Uncertainty 5.1 Introduction For the remainder of the thesis, we steer our attention to CPS control. We start by using our contributions from Chapter 2 to tackle the problem of safe control under uncertainty. More specifically, we focus on Learning from Demonstrations (LfD), an emerging paradigm to design control policies [109]. In this paradigm, learning agents acquire new skills not by programming but by imitating a human expert. LfD (also known as imitation learning) is a vibrant research area and several classes of methods have been developed such as behavior cloning [110] and direct policy learning [111]. Inverse reinforcement learning (IRL) enables a learning agent to infer the reward function from the given demonstrations. An important difficulty is that the optimization problem is under-defined as there may exist more than one reward function that explain a given set of demonstrations [112]. To address this issue, several approaches such as feature expectation matching [113], Bayesian IRL [114] and the Maximum Causal Entropy (MCE) IRL [115] have been developed. In a real-world setting, we often have to learn from imperfect or sub-optimal demonstra- tions, which may not be consistent with desired safety and performance specifications. To encode such specifications, several formal methods, such as Computational Tree Logic [116], Linear Temporal Logic [117], Signal Temporal Logic [118] have been developed and used for expressing specifications in a variety of control and learning tasks [119, 120, 121, 122, 123]. In the realm of stochastic dynamical systems (e.g., Markov Decision Processes), Probabilistic 75 Computational Tree Logic (PCTL) has been used to express safety specifications in the con- text of learning from demonstrations [124]. Recently, Signal Temporal Logic (STL) was ex- tended to three probabilistic variants, Probabilistic STL (PrSTL) [22], Chance Constrained Temporal Logic (C2TL) [9], Stochastic STL (StSTL) [23] and Stochastic Temporal Logic [24, 125]. Even though STL has been used in the context of learning from demonstrations [126, 120], its probabilistic extensions are yet to be explored. Contributions: In this chapter, we address the problem of learning from demonstrations subject to Stochastic Temporal Logic (StTL) specifications. Due to its popularity, we choose to extend the maximum causal entropy inverse reinforcement learning framework and we propose a problem formulation whereby the StTL specification is imposed as a constraint in the optimization problem. We show how to encode the StTL formula via a minimal set of mixed-integer linear constraints. We generate these constraints using cutting hyperplanes as an approximation of the feasible regions of atomic predicates and then we recursively propagate these hyperplanes through the formula schematics. The benefits of our approach are two-fold: first, our encoding produces a minimal set of integer constraints, and second, by using cutting hyperplanes, we avoid having to propagate non-linear constraints. 5.2 Preliminaries Markov Decision Processes. We consider the framework of infinite-time, discounted Markov Decision Process (MDP) def = (S,A,R,P,,γ), whereS andA denote the (finite) state and (finite) action space,P is the transition kernel, r(s,a)≡R is the reward, is an initial distribution and γ∈R is a discount factor. A stationary control policy π is a function that assigns a probability distribution over actions for all states. An MDP paired with a policy π induces a Markov Chain S π t , t = 0, 1,... over the same state space. The performance of a given control policy is measured via the expected discounted reward defined as R(π) def = E( P ∞ t=0 γ t (s t ,a t )), where the expectation is taken over trajectories sampled from the Markov 76 chain S π t induced by π. An optimal policy π ∗ is a policy that maximizes the return, i.e., π ∗ ∈ arg max π∈Π R(π). A finite set of trajectoriesD ={(s i 0 ,a i 0 ), (s i 1 ,a i 1 ), (s i 2 ,a i 2 ),...} i=1,2,... obtained by executing π in are called demonstrations. 5.3 StTL-Constrained Maximum Causal Entropy In this section, we present the agent model for the StTL-constrained inverse reinforcement learning problem. Let\R be the reduced MDP (i.e., the MDP without the reward function), ϕ an StTL formula that encodes safety and/or performance specifications andD a set of demonstrations. The agent seeks to recover a policyπ that ”best” imitates the given demon- strations and satisfies the specification. One way to quantify the ”quality of imitation” is to study the expected value of a reward feature vector φ :S×A→ [0, 1] d φ . For a policyπ and a discount factor γ, the feature expectation vector is defined as f r (π) =E( P ∞ t=0 γ t φ(s t ,a t )). Given a set of demonstrationsD ={(s i 0 ,a i 0 ), (s i 1 ,a i 1 ), (s i 2 ,a i 2 ),...} i=1,2,... , the empirical coun- terpart of the feature expectation vector is given by ˆ f r (D) def = 1 |D| P D,t γ t φ(s i t ,a i t ). The problem of inferring the reward and/or optimal policy is ill-posed because each policy can be optimal for many reward functions. The principle of Maximum Casual Entropy (MCE) [112, 115] resolves the ambiguity by maximizing the entropy of the distribution over trajectories subject to a feature expectation matching constraint, i.e., f r (π) = ˆ f r (D). Motivated by the MCE framework, we introduce the following problem: max π H β (π) def =E " − ∞ X t=0 β t logπ(a t |s t ) # (5.1a) f r (π) = ˆ f r (D), (5.1b) (S π , 0)| =ϕ, (5.1c) 77 whereH β (π) is the discounted casual entropy,β∈ (0, 1) is a discount factor andS π t denotes the Markov chain induced by and π. Constraints that ensure that π(a t |s t ) is a valid dis- tribution are omitted for simplicity. While the original MCE framework [127] assumes that the reward is a linear function w T f(s,a) for some feature vector f(s,a) and optimizes the weightw, we choose the variant introduced in [128], where the authors directly optimize the policy. We choose this variant as it is more convenient for imposing the StTL specification. Solving Problem 5.1 is far from trivial because the StTL specification (5.1c) is essentially a logic constraint. In what follows, we show how to encode the constraint (5.1c) via a minimal set of integer constraints and in the next section how to ensure that these constraints are linear. To this end, we parametrize the policy via a vector θ∈R d θ , i.e., π(a|s) def = π θ (a|s). Let p t (s) be the state distribution of the Markov Chain S π t at time t. We have p t (s) = X s 0 ,a p t−1 (s 0 )P(s|s 0 ,a)π θ (a|s 0 ), (5.2) for t = 1, 2,... and p 0 (s) =. We start by encoding StTL predicates via a set of linear constraints and then recursively propagate these constraints through the StTL schematics to generate constraints for any formula. StTL Predicates. A predicateμ def =P(S π t 6∈F )≤ overS π t is satisfied if and only if it holds that X s6∈F p t (s)ds = X s6∈F,s 0 ,a p t−1 (s 0 )P(s|s 0 ,a)π θ (a|s 0 )≤ (5.3) for all times t = 1, 2,... . We define the following quantity q c (s 0 ,a) = X s6∈F P(s|s 0 ,a). (5.4) By substituting Eq. (5.4) to Eq. (5.3), we obtain the following inequality c μ t (θ) = X s 0 ,a p t−1 (s 0 )q c (s 0 ,a)π θ (a|s 0 )−≤ 0. (5.5) 78 where the superscript μ denotes the dependence of the constraint on the predicate. Each predicate appearing in the specification ϕ gives rise to such a constraint. We show how to approximate this constraint of Eq. (5.5) via cutting hyperplanes in the next section. For now, we assume that the predicate μ is encoded via the following set defined by a set of linear constraints g μ t (θ) = n θ∈R d θ :θ T d μ t,i ≤ 0, i = 1, 2,...}, (5.6) where d μ t,i ∈R d θ is to be determined. Negation. To encode the negation of an atomic predicate¬μ def = P t (S π t 6∈ F )≥ we use similar reasoning as above and obtain a set of linear constraints on θ similar to Eq. (5.6). Conjunction. Let ϕ i , i = 1, 2,... ,m be a set of formulas and consider the formula ϕ = V m i=1 ϕ i defined by their conjunction. Let g i t (θ),i = 1, 2,...m be the sets that encode the formula ϕ i at time t. Since conjunction requires simultaneous satisfaction of all predicates, formulaϕ is encoded by the intersection of the constraints corresponding to each predicate, i.e., by the following set g ∧ t (θ) ={θ∈R d θ :θ∈ m \ i=1 g i t (θ)}. (5.7) Disjunction. Let ϕ i , i = 1, 2,... ,m be a set of formulas and consider the formula ϕ = W m i=1 ϕ i defined by their disjunction. Let g i t (θ),i = 1, 2,...m be the sets that encode the formula ϕ i at time t and b i t ∈{0, 1} be a set of binary variables. Then, the disjunction operator is encoded as follows g ∨ t (θ) ={θ∈R d θ :θ∈b i t g i t (θ), and m X i=1 b i t ≥ 1}. (5.8) The second constraint essentially ensures that at least one of the predicates is satisfied. Note that, in contrast to conjunction, the disjunction forces to introduce binary variables. 79 Always and Finally. The temporal operators G [a,b] ϕ and F [a,b] ϕ can be expressed as temporal conjunction and disjunction, respectively, over all time instances t in the interval I = [a,b]. Therefore, they can be encoded by the following constraints g Gϕ t (θ) ={θ∈R d θ :θ∈ b \ t=a g ϕ t (θ)}, (5.9) g Fϕ t (θ) ={θ∈R d θ :θ∈b t g ϕ t (θ), and b X t=a b t ≥ 1}, (5.10) where g ϕ t (θ)≤ 0 is the set that encodes the formula ϕ at time t and b t ∈{0, 1}. Until. Let ϕ,ψ be two StTL formulas. The bounded until can be expressed in terms of the unbounded until asϕU [a,b] ψ = G [0,a] ϕ∧F [a,b] ψ∧F [a,a] ϕUψ . Therefore, it suffices to encode the unbounded until. To this end, we introduce the following set g z t (θ) ={θ∈R d θ :g ϕ t (θ)≤ 0 and g ϕUψ t+1 (θ)≤ 0} (5.11) and encode the unbounded until as follows g ϕUψ t (θ) ={θ∈R d θ :b ψ g ψ t (θ)≤ 0, b z g z t (θ)≤ 0, b ψ +b z ≥ 1}, (5.12) for t = 1,... ,T− 1 and g ϕUψ T (θ) ={θ∈R d θ :g ψ T (θ)≤ 0}. The reader may have noticed a contradiction: by re-writing the formula using only con- junctions (i.e., using De Morgan’s law, ϕ∨ψ =¬((¬ϕ)∧ (¬ψ))), we can free ourselves from integer variables, which, unfortunately, is not the case. Observe that we encode the negation of atomic predicates not arbitrary sub-formulas. The reason is that if an arbitrary formula ϕ is encoded by a set g(θ), then the negation¬ϕ is not encoded by the complementary of g(θ) (or any other easy to extract relation). Therefore, we consider only StTL formulas that are re-written in negation normal form and treat the cases of atomic predicates and their negations separately. 80 Our method given in Algorithm 3 leverages Eq. (5.6-5.12) to encode an arbitrary formula by recursively generating a set of mixed-integer linear constraints on the parametersθ of the control policy. We effectively use three types of constraints: intersections of linear constraints (conjunction), unions of linear constraints (disjunction) and linear binary constraints that control which of the aforementioned linear constraints are active when taking their union. Our approach differs from the encoding presented in [128] because we encode predicates via linear constraints on the parameters of the control policy, whereas the later one enforces a constraint on an introduced binary variable. The benefit of our approach is that no binary variables are necessary to encode negations, conjunctions and the globally operator (i.e., conjunction is encoded as the intersection of constraints). Our approach requires exactly N d + (N u + 3)T binary variables, where N d is the number of disjunction operators, and N u the number of until operators. This is substantially better than theO(T|P|) binary variables required by [128], where|P| is the cardinality of the predicate set. Our algorithm essentially reduces all operations to N c conjunctions, N d disjunctions and introduces integer variables only for the later ones. Letg c i (θ) andg d jk (θ) be the (linear) constraints recursively generated by Algorithm 3. The former ones correspond to conjunctions while the later ones to disjunctions. We reformulate Problem 5.1 as follows max θ,b jk min λ≥0 L(λ;π θ ,D) (5.13a) θ∈g c i (θ), i = 1,... ,n c (5.13b) θ∈ n b [ k=1 b jk g d jk (θ), j = 1,... ,n d (5.13c) n b X k=1 b jk ≥ 1, j = 1,... ,n d (5.13d) whereL(λ;π,D) =H β (π)−λ T (f r (π)− ˆ f r (D)) is the Lagrangian,b jk ∈{0, 1}. Observe that we use Lagrangian duality to relax the feature expectation matching constraint (5.1b). This 81 Algorithm 3: Recursive Constraint Generation Inputs: Formula ϕ of maximum horizon T , ordered set P of predicates in ϕ, set {C μ } μ∈P of linear constraints 1 procedure RecConGen(ϕ,{C μ } μ∈P ): 2 switch ϕ do 3 case μ t or¬μ t do 4 return C μ 5 end 6 case V m i=1 ϕ i do 7 g i t (θ)←RecConGen(ϕ i ,{C μ } μ∈P ) 8 g ∧ t (θ)←{θ∈R d θ :θ∈ T m i=1 g i t (θ)} 9 return g ∧ t (θ), t = 1,... ,T 10 end 11 case W m i=1 ϕ i do 12 g i t (θ)←RecConGen(ϕ i ,{C μ } μ∈P ) 13 g ∨ t (θ)←{θ∈R d θ :θ∈b i t g i t (θ), i = 1, 2,... ,m and P i b i t ≥ 1} 14 return g ∨ t (θ), t = 1,... ,T 15 end 16 case ϕU [a,b] ψ do 17 g ϕ t (θ)←RecConGen(ϕ,{C μ } μ∈P ) 18 g ψ t (θ)←RecConGen(ψ,{C μ } μ∈P ) 19 g ϕUψ t (θ)← Eq. (5.12) 20 return g ϕUψ t (θ), t = 1,... ,T 21 end 22 end form is convenient because the resulting problem has only mixed-integer linear constraints and can be solved using standardized solvers. We note that, even though our framework is not restrictive to finite horizon formulas, we impose an upper bound of T to the horizon of ϕ. The number of constraints is proportional to T and, if the horizon is infinite, we have infinite number of linear constraints, which requires more sophisticated techniques to solve [129]. The issue is resolved by upper-bounding the time horizon. Finally, it is also worth noting that for the special case of a formulaϕ that consists of conjunctions and global operators, our method does not introduce any binary variables and Problem 5.13 has only linear constraints. 82 θ ac 1 θ ∗ • • • θ ac 2 θ T (θ ∗ −θ ac 2 ) = 0 θ T (θ ∗ −θ ac 1 ) = 0 D 1 D 2 Figure 5.1: Hyperplane approximation for the formulaϕ =μ 1 ∨μ 2 . The feasible region ofϕ isD 1 ∪D 2 and is approximated by the union of the two hyperplanes. The union is non-convex and a binary variable is introduced that controls which constraint is active. 5.4 Cut and Generate Method In the previous section, we assumed that the constraint given by Eq. (5.5) is encoded via a set of linear constraints and used this approximation to recursively generate constraints for ϕ. In this section, we show how to leverage the recursive constraint generation given in Algorithm 3 to generate linear constraints for any formula. The fundamental idea to iterate between using cutting hyperplanes to approximate the feasible region of Eq. (5.5) and generating constraints for the entire formula using the obtained hyperplanes. We call this approach cut-and-generate (Algorithm 4). We start by initializing the set of constraints C to the empty set (Line 1) a set C μ that keeps track of the linear constraints generated for each predicate μ (Line 2). Then, we solve the unconstrained Problem 5.13a (i.e., the Lagrange dual with the feature expectation matching constraint relaxed) to obtain an initial solution θ ∗ (Line 3). Next, we check if θ ∗ satisfies the specification ϕ. This is achieved by evaluating Eq. (5.5) atθ ∗ for each predicate inϕ and propagating the resulting truth values through the StTL schematics. If θ ∗ does not satisfy ϕ a new, more constrained problem is constructed. To this end, we use θ ∗ and a point that belongs to the feasible space of Eq. (5.5) in order to construct a cutting hyperplane as a linear approximation of that feasible space (see Fig. 5.1 for an illustration). That hyperplane introduces a new linear constraint to the problem and eliminates a half-plane from the search space. Intuitively, that point should be centered; 83 points close to the boundary eliminate ”less” space and may require more iterations. In the convex optimization literature, several centering methods have been proposed [130]. Due to its simplicity, we choose the analytic center, which is found by solving the following unconstrained minimization θ ac t = arg min θ∈R d θ logc μ t (θ), (5.14) Observe that the analytic center corresponds to the minimization of the logarithmic barrier function. To find the minimum, we first calculate the gradient of Eq. (5.5): ∇c μ t (θ) = X s 0 ,a q c (s 0 ,a) p t−1 (s 0 )∇π θ (a|s 0 ) +π θ (a|s 0 )∇p t−1 (s 0 ) , (5.15) where q c (s 0 ,a) is given by Eq. (5.4). The gradient of p t (s) can be found by differentiating Eq. (5.2) and is given by the following recursive equation: ∇p t (s) = X s 0 ,a P(s|s 0 ,a) p t−1 (s 0 )∇π θ (a|s 0 ) +π θ (a|s 0 )∇p t−1 (s 0 ) . (5.16) Even though Eq. (5.14) is not convex for practical policy parameterizations (e.g., neural networks), we leverage the gradient given by Eq. (5.15) to find a local minimum via standard gradient methods. Also, note that we need to find the analytic center only once for each predicate (i.e., we can pre-solve Eq. (5.14) for speed). Following that, we use the vector θ ∗ −θ ac t to construct a cutting hyperplane and introduce this constraint to the problem (Lines 6-8, the ”cut” step). Then, we use RecConGen algorithm to recursively generate constraints for the entire formula (Line 10, the ”generate” step) and add those to the problem (Line 11). Observe that the constraints that define the set Θ (Line 11) contain integer variables, which are created by the RecConGen algorithm. Finally, we solve the newly constructed problem (Line 12) and iterate. The main benefit of our approach is that we approximate the feasible region of predicates (Eq. (5.5)) via linear constraints and then propagate these linear constraints through the formula. This allows us to approximate the 84 Algorithm 4: Cut and Generate Algorithm Inputs : MDPM\R, demonstrationsD, parameterized policy π θ (a|s), formula ϕ of maximum horizon T , ordered set P of atomic predicates in ϕ Output: Optimal policy π θ ∗ 1 C←{} 2 C μ ←{}, for all predicates μ∈P 3 θ ∗ ← arg max θ min λ≥0 L(λ;π θ ,D) 4 while θ ∗ is not feasible do 5 forall predicates μ in P do 6 θ ac t ← arg min logc μ t (θ) 7 g μ t (θ)←{θ∈R d θ :θ T (θ ∗ −θ ac t )≤ 0} 8 C μ ←{θ∈R d θ :θ∈g μ t (θ)∧θ∈g(θ),∀g(θ)∈C μ } 9 end 10 C←C∪ RecConGen(ϕ,{C μ } μ∈P ) 11 Θ←{θ∈R d θ :g(θ),∀g(θ)∈C} 12 θ ∗ ← arg max θ∈Θ min λ≥0 L(λ;π θ ,D) 13 end Figure 5.2: Recovered optimal stochastic policies for L1-L4. The agent starts from (0, 0) and receives a reward equal to 1 when it reaches (4, 4). The color-map of each triangle in the corresponding rectangle denotes the probability of taking the underlying action. feasible region of the problem via mixed-integer linear constraints. Note that if we used directly the non-linear constraint given by Eq. (5.5) that would result in mixed-integer non-linear constraints, which is far more difficult to solve. 5.5 Experimental Evaluation Single-Goal GridWorld (SGGW). Consider an agent moving in an N×N GridWorld enviroment. The agent starts deterministically from the bottom-left corner (s = (0, 0)) and receives a reward equal to 1 at the terminal state (upper-right corner, s = (N− 1,N− 1)). 85 When the agent takes an action, it transits to the correct next state with probability 1−p slip and with probabilityp slip it ”slips”, i.e., it arrives at one of the remaining 3 states with equal probability. We assume that there exists a ”hole” state and an ”attractor” state, denoted with H and A, respectively. We consider the following StTL specifications: ϕ 1 = G [0,T ] P(d(s t ,H)≤c 1 )≤ 1 (5.17) ϕ 2 = F [0,T ] P(d(s t ,A)≤c 2 )≥ 1− 2 (5.18) where d(·,·) denotes the distance function on the grid. The hole is placed on the upper-left corner (H = (0,N− 1)) and the attractor on the bottom-right (A = (N− 1, 0)). We choose a softmax policy and the remaining parameters are as follows: N = 5,T = 10, 1 = 2 = 0.1, c 1 = 1,c 2 = 0. We consider 4 different learner models: a) an unconstrained learner (L1). This is essentially the standart MCE inverse reinforcement learning framework (i.e., Problem 5.1 without the StTL constraint) and acts as a baseline, b) a learner constrained by ϕ 1 (L2): This learner must always stay away from the hole state with the given probability threshold, c) a learner constrained by ϕ 2 (L3): This learner must eventually get sufficiently close to the attractor state and d) a learner constrained by ϕ 1 ∧ϕ 2 (L4): This learner is constrained by the conjunction of ϕ 1 and ϕ 2 , i.e., it must stay away from the hole and approach the attractor. The results are shown in Fig. 5.2. Observe that L2 tends to ”push” the agent away from the upper-left corner; L3 tends to ”attract” the agent to the bottom-right corner; and L4 combines both. Multi-Goal GridWorld (MGGW). This enviroment is similar to the previous one, albeit the transitions are deterministic, there exist obstacles and multiple goal states. We consider the following formulas: 1. Obstacle avoidance: ϕ 3 = G [0,T ] P(d(s t ,obs)≤ c 3 )≤ 3 , where d(s t ,obs) is the distance of the agent from the closest obstacle. 86 Figure 5.3: The multi-goal gridworld enviroment along with the policy recovered under specification ϕ 3 ∧ϕ 4 . 2. Multi-Goal reach: The agent is required to reach either one of goal states, i.e., ϕ 4 = F [0,T ] P(d(s t ,g1) =c 4 )≥ 1− 4 ∨ F [0,T ] P(d(s t ,g2) =c 5 )≥ 1− 5 . The imposed specification is ϕ 3 ∧ϕ 4 . To generate expert trajectories, we find the optimal policy using value iteration and then run it in the enviroment. The parameters are as follows: N = 8,T = 16,c 3 = 1,c 4 =c 5 = 0, 3 = 4 = 5 = 0.05. The results are shown in Fig. 5.3. Robustness. To quantify the degree of satisfaction of each of the aforementioned formulas over the policies learned by L1-L4, we use the metric of StTL robustness. For a more formal treatment of StTL robustness, the reader may consult [125]. Informally, high positive robustness indicates ”better” satisfaction, negative robustness denotes that the formula is not satisfied and robustness value equal to zero denotes marginal satisfaction. For the multi- goal gridworld case, we consider 4 different learners, similarly to the SGGW. The results are shown in Tables 5.1 and 5.2. Observe that each formula attains the higher robustness over the policy which is trained for that formula, i.e.,ϕ 1 attains higher robustness for learner L1, which is trained under ϕ 1 . Notice that the cross terms (e.g., ϕ 1 under L3) attain slightly positive robustness values. This is explainable due to the nature of the specifications and the enviroment. For instance, if the agent avoids the hole, then it is more inclined to be closer to the attractor. Finally, observe that L1 attains negative robustness for all formulas, which indicates that the standard MCE does not suffice to satisfy the specifications. Scalability and Time Complexity. To demonstrate the applicability of our method in problems with larger state spaces, we perform a scalability analysis. In more detail, we define the metric overhead as the run-time of Algorithm 4 subtracted by the run-time of the MCE 87 Table 5.1: Single-Goal GridWorld L1 (MCE) L2 L3 L4 ϕ 1 −0.44 1.98 0.10 0.15 ϕ 2 −0.25 0.07 1.44 0.28 ϕ 1 ∧ϕ 2 −0.23 0.12 0.25 1.07 Table 5.2: Multi-Goal GridWorld L1 (MCE) L2 L3 L4 ϕ 3 −0.14 1.24 0.18 0.25 ϕ 4 −0.35 0.15 1.58 0.13 ϕ 3 ∧ϕ 4 −0.25 0.17 0.24 1.21 Robustness of the underlying StTL formulas with respect to policies learned by learners L1-L4. Higher robustness is better and negative implies that the formula is not satisfied. 0 8 16 24 32 40 48 56 64 0 50 100 150 200 250 300 350 400 450 500 grid size overhead (s) SGGW MGGW Figure 5.4: Empirical complexity estimation algorithm (line 12). The reason for defining that metric is that our method essentially builds on top of the MCE algorithm, which we use as a black box and have no influence over its run-time. Therefore, this metric essentially captures the number of successive refinements in the hyperplane approximation. We compute the overhead for both SGGW and MGGW for different grid sizes. In the MGGW case, the size of the obstacle increases with the size of the grid. We consider formulas ϕ 1 ∧ϕ 2 and ϕ 3 ∧ϕ 4 , respectively, and the time horizon is set to 2N, where N is the size of the grid. The result shown in Fig. 5.4 indicates a linear dependence of the time overhead on the size of the grid for both enviroments, which is tolerable overhead given the problem complexity. 5.6 Summary We address the problem of learning from demonstrations when the agent must satisfy a set of specifications expressed as Stochastic Temporal Logic (StTL) formulas. Building on the maximum causal entropy inverse reinforcement learning framework, our method uses cutting 88 hyperplanes to approximate the feasible region of StTL predicates and propagates these hyperplanes through the StTL schematics to generate constraints for arbitrary formulas. This results in a set of mixed-integer linear constraints that encode the satisfaction of the specification. We validated the practical usability of our approach in different enviroments and specifications. 89 Chapter 6 Control of Complex Networks 6.1 Fundamentals of Complex Network Control Recent advances in sensing and actuating technologies have led to unprecedented data streams from diverse large-scale physical, technological and social systems. This includes many important application domains such as efficient and reliable power delivery in smart grids, throughput maximization in Internet and transportation networks, understanding and controlling processes like disease or rumor spreading in social networks. This trend has given rise to an increased interest in applying control-theoretic tools in the analysis of complex dynamical networks. Given the large-scale nature of these systems, in most practical cases, it is unrealistic to expect that we would be able to interact and/or measure the state of each individual component. As such, one of the most important problems is that of effectively placing actuators and/or sensors such that certain properties are satisfied. A prominent example on this field was presented by Liu et al. in [131] where the authors used Kalman’s rank condition and the idea of structural controllability [132] to solve the minimum AP problem by solving a maximum bipartite matching. Even though several variants of that problem have been consecutively considered, [133, 134, 135], these approaches are limited by the fact that they focus exclusively on structural controllability which may be a crude metric is some cases, as suggested by empirical findings in the field of cellular reprogramming [136]. On the other hand, control-energy (i.e. Gramian) based metrics [137] have been exten- sively used [138, 139, 140] to quantify the controllability of a complex network as they are 90 related to the energy required to move the system in the state-space. An additional bene- fit of using Gramians is that some of the corresponding metrics exhibit the submodularity property [141], which allows us to solve the AP problem for large-scale systems using greedy methods that are efficient and ensure sub-optimality. In this line of work, the proposed approaches include convex relaxations under rank constraints [142] and upper bounds on the control effort [143]. Unfortunately, the later approaches rely on the submodularity of the average control energy, which was initially thought to be submodular but it was later proved to be non- submodular [144]. Additionally, these approaches are restricted by the fact that they focus only on the number of placed actuators. This indicates that a homogeneous AP cost is implicitly assumed, which may be misleading as the cost to actuate the network may depend on where the actuators are placed either due to device cost or the location/properties of the actuation site. Additionally, most work on actuator placement has been under the assumption that systems are inherently described by integer order dynamics and not much is known for the case where the dynamics are characterized by long-term memory. Long- term memory is captured by the notion of fractional derivative [145] and many authors have shown that it is a better suited operator to describe the dynamics observed in real- world systems such as brain networks [146, 147], power networks [148] and physiological and biological networks [149]. Even though classical control-theoretic notions have been extended to fractional systems [145], energy based methods for AP under cost constraints have not be proposed. Contributions: In this chapter, we aim to address the aforementioned limitations and provide a unified framework for AP in complex networks that takes into account a wide spectrum of temporal dynamics under possible heterogeneous actuation placement costs. In more detail we propose a linear, discrete time, fractional order dynamical system as a model for complex dynamical networks with long-term memory. Furthermore, we consider the cost-efficient AP problem for this system, where we assume heterogeneous AP costs and 91 upper-bound the total cost via a knapsack-like constraint. Building on recent advances in theoretical computer science for non-submodular optimization under knapsack constraints, we present a greedy algorithm for the AP problem with approximation guarantees that depend on quantities that measure how far the Gramian metric is from being sumbmodular. Our method is validated in an extensive set of simulations. 6.2 Network Model and Problem Statement We denote the set of natural numbers{1, 2,..,n} asN, the set of real numbers asR, and let [n] ={1, 2,..,n} for alln∈N. Given a setV ,|V| denotes its cardinality and 2 V its powerset. Matrices are represented by capital letter and vectors by lower case letters. The function 1(·) is the indicator function which take the value 1 when the statement in the argument is true and the value 0 otherwise. We write A 0 (A 0) to denote a positive definite (positive semi-definite) matrix. We denote with I n the identity matrix of size n. When the subscript is omitted the dimension shall be inferred from the context. For a vector s∈R n , S = diag(s) denotes the n×n diagonal matrix such that S ii = s i . A weighted, directed simple graph is represented by the tripletG = (V,E,w), whereV denotes the set ofn =|V| nodes, E⊆ V×V the set of edges and w : E→R is a function that assigns a weight to each edge. Two nodes v i ,v j ∈ V , where i,j∈ [n], are called neighbors iff (v i ,v j )∈ E, and if, in addition to that, (v j ,v i )∈ E then the graph is said to be undirected. Additionally, A∈R n×n denotes the (weighted) adjacency matrix such that A ij =w(v i ,v j ). LetG = (V,E,w) be a graph andA its adjacency matrix. Each nodev i ∈V is associated with a dynamical state x i [k]∈R which evolves in discreet time k∈N + influenced by the interactions of its neighbors. We model the temporal evolution of the entire network via the following fractional order complex dynamical network Δ α x[k + 1] =Ax[k] +B S u[k] (6.1) 92 where x[k] = (x 1 [k],x 2 [k],...x n [k]) T ∈R n is a vector containing the states of all the nodes in the network at time k, x[0] =x 0 is a given initial state, u[k]∈R m is the value of the m- dimensional control input signal injected in the network at timek andα∈ (0, 1] n is a vector of fractional order exponents. The matrix B S ∈R n×m is the input matrix, which identifies those nodes that are actuated by a control input signal. The operator Δ α = [Δ α 1 ,..., Δ αn ] T denotes the Grunwald-Letnikov discretized derivative of orderα, which models the effects of temporal, long-term memory. Given a vector δ∈{0, 1} n , we define the input matrix as B S =diag(δ). (6.2) Eachs i = 1 indicates that the corresponding node is being actuated whereasδ i = 0 indicates that the node receives no actuator. In what follows, we use the following definition: Definition 6.1 (Actuator Set). Given a vector s∈{0, 1} n define S ={s i ∈V :δ i = 1}. (6.3) The subset S⊆V is called the actuator set and each node s i ∈S is called an actuator. We consider the problem of finding a minimal actuator set such that some specified metric of controllability is optimized. Given an actuator set S, let f(S) : 2 V →R be one such metric. We define a cost functionc :V→R ∗ which associates a non-negative costc(v) to each node v∈ V in the network. Let C be a maximum allowed value for the AP cost. We define the following problem: maximize S⊆V f(S) subject to X s∈S c(s)≤C (AP ) This formulation is essentially a knapsack-constrained AP problem and generalizes the cardinality-constrained problem which has been extensively considered in the literature. 93 In fact, by assuming unit cost for all the nodes, the constraint of Problem (AP ) reduces to a cardinality constraint. Before tackling this problem, we study in the next section the controllability properties and the minimum energy state transfer problem for the fractional order model proposed in Eq. (6.1). 6.3 Minimum Energy State Transfer In this section we gain further insight into the model given by Eq. (6.1) and we study the minimum control energy state transfer problem. Each element of the Grunwald-Letnikov operator is given as follows Δ α i x[k] = 1 h a i k X j=0 (−1) j α i j ! x(h−j), (6.4) for allα i ,i = 1,...,n. The sampling periodh∈N + is taken as unity. The binomial coefficient is given as follows α i j ! = 1(j = 0) + Γ(α i )Γ(α a − 1)...Γ(α i −j− 1) Γ(j) 1(j > 0), (6.5) where Γ(·) is the Euler Gamma function. If all fractional order coefficients are equal then this is refereed to as a commensurate order systems, otherwise it is a non-commensurate order system. When the fractional order exponents approach 1 the system tends to become memory-less and when they are equal to 1 this model reduces to the commonly used one for integer order complex dynamical networks [131]. On the contrary, when these coefficients approach zero the system exhibits far more pronounced memory, meaning that past states have a greater effect on future. By expanding the Grunwald-Letnikov operator and defining A 0 =A− (c 1 + 1)I n ,A j =diag{c j+1 ,i = 1, 2,...n} wherec j =−(−1) j a i j , we can write Eq. (6.1) as follows x[k + 1] = k X j=0 A j x[k−j] +B S u[k] (6.6) 94 This equation elucidates the fact that the system described by Eq. (6.1) exhibits long-term memory. The model described by Eq. (6.6) can be characterized as a discrete-time system with infinite time-delay in the state. To extract the solution, we define the following matrix G k = 1(k = 0)I n + 1(k≥ 0) k−1 X j=0 A j G k−j−1 . (6.7) Note that the definition is recursive and that G k can be expressed as a recurrent sum of products of A j G k−j−1 or, equivalently, as a recurrent sum of products of A j . Lemma 6.1. The solution to Eq. (6.1) is given by x(k) =G k x(0) + k−1 X j=0 G k−j−1 B S u(j) (6.8) A fundamental question for the system given in Eq. (6.1) is whether it is possible to transfer the state of the system from a given initial value to any other state. To address this question, we introduce the following notion. Definition 6.2. Given a terminal time K∈N + , the model described by Eq. (6.1) is called K−controllable if for all initial states x 0 ∈ R n and all final states x f ∈ R n there exists a control input u(k),k = 0, 1,..,K− 1 that transfers the system from x(0) =x 0 to x(K) =x f . While this definition appears to be similar to the well known definition of controllability, a subtle difference lies on the fact that in theK-controllable case the terminal time is given. K-controllability is a fundamental property of fractional order systems and can be viewed as a structural attribute that depends on the topology of the graphG via its adjacency matrix A and on the structure of the matrix B s . Definition 6.3 ([145]). Given a terminal timek∈N + , we define thek-controllability matrix as follows C(0,k) = [G 0 B S G 1 B S G 2 B S ··· G k B S ] (6.9) 95 and the K-controllability Gramian as follows W S (O,k) = k−1 X j=0 G j B S B T S G T j . (6.10) It is easy to show that W S (0,k) =W T S (0,k), W S (0,k) 0 and W S (0,k) =C(0,k)C T (0,k). It is worth noting that, due to the recursive structure of the matrices G j , no Lyapunov equation can be extracted for the the K-controllability Gramian. However, this posses no restriction as it can be easily calculated via the summation in Eq. (6.10). In fact, an additional benefit of using Eq. (6.10) is that the system is not required to be stable. Theorem 6.1. The system given by Eq. (6.1) is K−controllable if and only if rank(W S (O,K)) = n, or equivalently if and only if rank(C(O,K)) = n. Additionally, the input sequence given by U K =C(O,K)W −1 c (0,K)(x f −G K x 0 ) (6.11) solves the following minimum input energy state transfer problem: minimize u(k) J u (0,K) subject to Δ α x[k + 1] =Ax[k] +Bu[k] x(0) =x 0 ,x(K) =x f (P 1) where J u (0,K) = P K−1 k=0 u(k)u T (k) is the cost functional. This implies that U K transfers the system from the initial state x(0) = x 0 to the final state x(K) = x f while minimizing the control energy. The minimal value of the cost function is J ∗ u (0,K) = (x f −G k x 0 ) T W −1 S (0,K)(x f −G k x 0 ) (6.12) 96 An important remark that we need to make is that in the case of integer order it is well known that the rank of C(0,k) cannot increase for k≥n. Therefore, it suffices to examine C(0,n). On the contrary, for the general case of system (6.1) the rank ofC(0,k) may increase for values of k higher than n [145]. Hence, examining the controllability (in the standard notion) of system (6.1) is inconsequential; if C(0,k) is not full row-rank for a k≥ n there exists no guarantee that this is true fork+1. The notion ofK-controllability circumvents this problem by fixing the time horizon over which controllability is studied. Similar statements can be made for other scalar metrics based on the K-controllability Gramian. 6.4 Efficient Actuator Selection In the AP literature several scalar metrics have been proposed such as the trace of the Gramian, the trace of the inverse or the logarithmic determinant. Even though such metrics have been used extensively, they may not necessarily capture the minimum input energy when the problem under consideration is to transfer the system from one given state to another. In such cases, it is better to consider the optimal objective value of Problem P 1. Hence, we propose the following objective function for Problem AP : f(S) =−u T W −1 S u, (6.13) where u = (x f −G K x 0 ) and we have omitted the arguments from the K-controllability Gramian. Since the trace of inverse of the Gramian is not submodular [144], we expect the same for Eq. 6.13. In what follows we review some notions associated with submodularity that will be required for stating the results in the context of Problem (AP ). Definition 6.4. The set functionf : 2 V →R is called monotone increasing if for all subsets S⊆B⊆V it holds thatf(S)≤f(T ) and monotone decreasing if for all subsetsS⊆B⊆V it holds that f(S)≥f(T ). The definition can be extended to monotone non-increasing/non- decreasing in standard fashion. 97 Definition 6.5. The set functionf : 2 V →R is called submodular if for all subsetsS⊆T⊆ V and all elementss6∈T , it holds thatf(S∪{s})−f(S)≥f(T∪{s})−f(T ) or equivalently, if for all subsets S,T⊆V , it holds that f(S) +f(T )≥f(S∪T ) +f(S∩T ). If the reversed inequalities hold then the function is called supermodular and it is called modular if it is both submodular and supermodular. Intuitively, submodularity is a diminishing returns property as it implies that adding an element to a smaller set gives a larger gain than adding to a larger set. The function Δ f (ω|S) = f(S∪{ω})−f(S) gives the marginal gain of inserting node ω into set S. The condition for submodularity can be expressed in terms of the marginal gains as: Δ f (ω|S)≥ Δ f (ω|T ) for all S⊆T⊆V and all elements s6∈T . Lemma 6.2. The set function f : 2 V →R is monotone increasing if and only if Δ f (ω|S)> 0 for all ω ∈ V,S ⊆ V and monotone decreasing if the reverse inequality holds. If the inequalities are not strict then it is monotone non-decreasing and monotone non-increasing, respectively. Definition 6.6. Let f : 2 V → R be a set function. The Diminishing Returns ratio (DR- ratio) is defined as the largest γ∈ [0, 1] such that γΔ f (ω|T )≤ Δ f (ω|S) (6.14) for anyS⊆T⊆V andω∈V\T . The submodularity ratio is defined as the largestβ∈ [0, 1] such that βΔ f (T|S)≤ X ω∈T\S Δ f (ω|S) (6.15) for all S,T⊆V . Finally, the curvature is defined as the smallest a∈ [0, 1] such that Δ f (ω|S\{ω}∪T )≤ (1−a)Δ f (ω|S\{ω}) (6.16) for all T,S⊆V and ω∈S\T . 98 For any non-decreasing set function it holds that γ≤β and the function is submodular if and only if γ = β = 1. Additionally, the function is supermodular if and only if a = 0 [150]. These quantities measure the “distance” of a set function to being submodular or supermodular and allow us to derive approximation guarantees for greedy algorithms. Theorem 6.2. The set function f(S) =−u T W −1 S u is monotone non-decreasing. Addition- ally, define the matrices Q ± (ω|S) =W S (W −1 ω ±W −1 S )W S (6.17) and letλ ± 1 (ω|S)≤...≤λ ± n (ω|S) be its eigenvalues andy ± i (ω|S),i = 1,..,n the corresponding eigenvectors. The DR-ratio γ, submodularity ratio β and curvature a of the set function f(S) =−u T W S u are bounded as follows: 1≥β,γ≥ u T Q −1 + (∅|V )u· min ω∈V λ + n (ω|ω) n·max i,ω,T u T y + i (ω|T ) 2 , (6.18) 0≤a≤ 1− min ω∈V λ − n (∅|ω) min i,ω,T,S u T y − i (ω|S∪T ) 2 λ − 1 (∅|V ) max i,ω,T u T y − i (ω|,T ) 2 , (6.19) where the minimum and maximum are taken over all i = 1,..,n,ω∈V,S,T⊆V . Proof. To prove that f(S) is monotone non-decreasing, we utilize Lemma 6.2. For all ω∈ V,S⊆V , we have Δ f (ω|S) =−u T (W S∪{ω} −W S )u =−u T (W S +W ω ) −1 −W S u =u T W S (W −1 S +W −1 ω )W S −1 u =u T Q −1 + (ω|S)≥ 0, (6.20) where we used the Woodbury matrix identity (A +B) −1 = A −1 −A −1 (A −1 +B −1 ) −1 A −1 and the fact that if A,B 0 then A(A −1 +B −1 )B −1 0. The proof for the second part of the theorem builds upon the one in [151] and is derived by appropriately upper and lower 99 bounding Eq. (6.14)-(6.16). Initially, it is easy to see thatQ + (∅|V )Q + (ω|S) for allω∈V and S⊆V . Taking into account Eq. (6.20), we have Δ f (ω|S) =u T Q −1 + (ω|S)u≥u T Q −1 + (∅|V )u. (6.21) Additionally, for the right-hand side of Eq. (6.15), we have X ω∈T\S Δ f (ω|S)≥|T\S|u T Q + (∅|V )u≥u T Q + (∅|V )u. (6.22) For the upper bound, we have Δ f (ω|S) =u T Q −1 + (ω|S)u = n X i=1 u T y + i (ω|S) 2 λ + i (ω|S) ≤n· max i,ω,T u T y + i (ω|T ) 2 min ω∈V λ + n (ω|ω) , (6.23) where the maximum is taken over all i = 1,..,n,ω∈ V,T⊆ V . By combining Eq. (6.21)- (6.23) we obtain Eq. (6.18). The proof of Eq. (6.19) is similar. These bounds allow us to propose a greedy algorithm for Problem (AP ) (Algorithm 1). The algorithm builds upon the work of S. Khuller for the budgeted maximum coverage problem [152], where a partial enumeration scheme of all feasible solutions was introduced. The proposed algorithm follows a similar logic and consists of two phases. In the first phase, we compute the setS 1 of maximal objective value among all feasible subsets of size|S|≤P , whereP = l a γ(a−1+e −aβ ) m . In the second phase, we restrict to feasible solutions of size|S| =P . In each iteration t of the second phase we pick the node v t that maximizes the ratio ρ(v) of the marginal gain to the cost associated with actuating v. If the cost constraint is not violated then v is added to the set S t otherwise it is discarded. The phase 2 terminates by picking from all setsS t the one that maximizes the objective value. Finally, the optimal set S ∗ is chosen between S 1 and S 2 . The approximation guarantee is provided by the following theorem. The proof can be found in [153]. LetS greedy be the actuator set given by Algorithm 100 Algorithm 5: Greedy for Non-Submodular Optimization Input: Set function f(S) : 2 V →R 1 P← l a γ(a−1+e −aβ ) m 2 S 1 ← arg max S {f(S) :S∈V,|S|≤P,c(S)≤C} 3 forall Y⊆V,|S| =P,c(Y )≤C do 4 S 0 =Y,V 0 =V,t = 1 5 repeat 6 ρ(v)← Δ f (v|S t−1 ) c(v) 7 v t ← arg max v∈V t−1 \S t−1ρ(v) 8 if P t τ=1 c(v τ ) + P y∈Y c(y)≤C then 9 S t ←S t−1 {v t },T t ←T t−1 10 end 11 S t ←S t−1 ,T t ←T t−1 ∪{v t } 12 t←t + 1 13 until V t \S t 6=? 14 S t 2 ← arg max S tf(S t ) 15 end 16 S 2 ← arg max S t 2 f(S t 2 ) 17 S ∗ ← arg max {S 1 ,S 2 } {f(S 1 ),f(S 2 ))} 18 return S ∗ 5 for Problem AP using Eq. (6.13) as the objective function and S opt be the corresponding optimal solution. The following approximation guarantee holds true: f(S greedy )> 1 a (1−e −aβ )f(S opt ), (6.24) where a is the curvature and β is the submodularity ratio of f(S), as given in Theorem 6.2. Since the approximation guarantee provided by Theorem 18 is tight for the problem of non-submodular optimization under a cardinality constraint, which is a special case of Problem (AP ), it is also tight for Problem (AP ), [150]. Additionally, the presented greedy algorithm is not restricted to the proposed objective function given by Eq. (6.13), but it is applicable to any submodular or non-submodular objective, such as those mentioned in the beginning of this section. More specifically, the submodularity ratio and curvature for non-submodular objectives such as the trace of the inverse and the minimal eigenvalue have 101 been extracted by other researchers in [151]. However, to the best of our knowledge, this is the first paper to provide approximation guarantees for the objective function given by Eq. 6.13, especially in the context of fractional order dynamics. We emphasize that computing the exact bounds for the DR-ratio, submodularity ratio and curvature is computationally intractable for large networks. This is due to the fact that the maximization in Eq. (6.18) and the minimization in Eq. (6.16) are taken over all possible subsets of the set V . Nonetheless, we can easily obtain a good empirical approximation by sampling random subsets from the power set of V and taking the maximization over those samples. Alternatively, we can estimate these quantities by randomly generating subsets according to the definitions given by Eq. (6.14)-(6.16) and find the largest values of β,γ and the smallest value of a such as the respective inequalities are satisfied. An additional computational issue that we need to address is the non-invertibility of the Gramian which arises when the actuator set does not render the system controllable. A simple remedy is to assume that there is a pre-existing set of actuators that render the system controllable. In practice, this assumption is consistent with perturbing the Gramian by I for a small to ensure that no numerical issues arise. 6.5 Case Studies We present experimental results of our proposed method for random networks generated from the Erd˝ os–R´ enyi and Barab´ asi–Albert models. Initially, we generate unweighted random networks, the structure of which defines the non-zero entries of the network dynamics matrix (i.e the matrixA in Eq. (6.1)). Then, we assign weights to the non-zeros entries of this matrix by drawing samples from the standard normal distribution and costs to the nodes by drawing uniform random numbers in [0, 1]. We use identical fractional order exponents equal to 0.5 for all the nodes of the network. The time horizon was set to K =n and the upper bound on the cost wasC = 5. Both the initial and final state were randomly chosen vectors inR n . 102 (a) (b) Figure 6.1: (a) Erdos-Renyi (p = 0.4) network of n = 30 nodes. Smaller nodes have smaller AP cost. Algorithm 5 selected 12 nodes to be actuated (shown in red). (b) Barab´ asi–Albert (m 0 = 5,m = 4) network of n = 30 nodes. Smaller nodes have smaller AP cost. Algorithm 5 selected 14 nodes to be actuated (shown in red). Figure 6.1a shows an instance of the Erd˝ os–R´ enyi model where the edge connection probability was p = 0.4, higher than the threshold of ln(40)/40 to ensure connectivity [154]. Figure 6.1b shows an instance of a Barab´ asi–Albert network, whose edge structure is generated with a preferential attachment mechanism that produces power law degree distributions [155]. The initial network consists of m 0 = 5 node and each newly added node is connected to m = 4 existing nodes with a probability that is proportional to the number of links that the existing nodes already have. Observe that in both cases the greedy algorithm tends to select nodes that have lower AP cost. Nonetheless, there exist nodes of lower cost that were not selected. This indicates that the ratio ρ(v) (line 6) of the marginal gain over the cost for such nodes is low. This suggests that it may be preferable to actuate high cost nodes if the benefit (in terms of the marginal gain on the objective function) is high. In addition to that, we studied the effect of the time horizonK on the resulting actuator set by using two different time horizons: K = n and K = 3n. We observed that the time horizon has virtually no effect on the resulting actuator set. This indicates that, even thought the controllability properties of fractional order systems can - in theory - change for K >n, in practice, setting a time horizon equal to the system dimension is sufficient. 103 To access the performance of Algorithm 5 when compared against the optimal actuator set, we restrict to small networks such that enumeration of all possible actuator sets is computationally tractable. In more detail, we generate 50 instances from both network models of size equal ton = 12 nodes and all the other parameters as before. We find that the greedy algorithm achieves on average more than 95% of the global optimal objective value for both the networks models. In fact, the greedy algorithm recovered the exact optimal actuator set for approximately 35% of the instances of the Erd˝ os–R´ enyi model and for 25% of Barab´ asi–Albert model. Given that we used diverse network models, randomly generated AP costs and we sampled multiple instances from each model, these findings constitute strong empirical evidence for the effectiveness of Algorithm 1. 6.6 Summary This part of the thesis studies the actuator placement problem under a cost constraint for complex dynamical networks with long term memory, for which we introduced a fractional order model and studied the minimum energy state transfer problem. Using the optimal value of the cost functional for the later problem, we proposed a Gramian-based set func- tion for the actuator placement problem and derived bounds on its DR-ratio, submodularity ratio and curvature. We leveraged these bounds to propose a greedy algorithm with ap- proximation guarantees that takes into account the actuator placement cost constraint. We validated our proposed method using randomly generated networks from the Erd˝ os–R´ enyi and Barab´ asi–Albert models. We compared the performance of the greedy algorithm against the optimal solution for networks that are small enough to permit brute-force search and observed that it achieves on average 95% of the global optimal objective value. 104 Chapter 7 Control under Multiple Objectives 7.1 Introduction In this, and final, chapter of the thesis we address the problem of controlling under mul- tiple objectives. The systemic assumption is that no model is available but a simulator of the system may be queried arbitrarily many times. Such an assumption is essentially the building premise of Reinforcement Learning and has a pivotal role in solving several CPS control problems of practical interest that would otherwise be intractable, such as robotic locomotion [156], the Atari [157] and Go games [158] to name a few. While such approaches have focused on the scalar reward setting, there exist many real-world problems that have multiple, conflicting objectives and, therefore, cannot be addressed by the current reinforce- ment learning tools. In such scenario, it is challenging to find an optimal control policy as the trade-offs (preferences) among the objectives may not be precisely known at training time or may differ from user to user. The paradigm of Multi-Objective Reinforcement Learning (MORL) provides a generalized framework for dealing with multi-dimensional rewards sig- nals and has been identified as one of the main challenges of real-world reinforcement learning [159]. However, learning optimal policies for MORL has been proven to be quite challenging, because most strategies require either to have access to or be able to infer a quantification of the relative importance of the objectives, or they perform sophisticated searches in the value or policy space aiming to find an ensemble of policies that are non-inferior to each other. The former methods lack adaptability and generalizability, as their performance is tied to the given or inferred preferences, while the later ones suffer from scalability issues, because 105 they learn multiple policies, and are usually complicated to implement. In this work, we aim to strike a balance between those two families of methods. Related Work: Multi-Objective Optimization (MOO) provides the fundamental tools for MORL. In MOO, there are two common solution concepts: scalarization and Pareto optimality. The former one derives a scalar objective and uses standart single-objective optimization techniques [160, 161, 162, 163, 164]. The later method is based on the concept of Pareto dominance and considers the set of all non-inferior solutions [165, 166]. Multiple gradient methods that leverage first-order, necessary conditions for Pareto optimality have also been developed [167, 168, 169]. Like MOO, existing work in MORL can be roughly divided into two main categories. Single-policy methods aim to maximize a single, scalarized reward. These methods essen- tially transform the problem into a single-objective MDP and differ mostly in the way they determine and express the preferences. Scalarization is usually performed using a weighted sum of the reward vector [170, 171, 172] or, less commonly, using linear mappings [173]. Different single-policy methods are based on thresholds or lexicographical ordering [174] or different classes of preferences [175, 176]. More recently, a scalarized Q-learning algorithm has bee developed [177] which uses the concept of corner weight to infer and optimize over preferences. It was extended to handle dynamic preferences in [178] and to utilize the convex envelop of the Pareto front in [179]. Finally, a scale-invariant supervised learning approach for encoding the preferences was developed in [180]. On the other hand, multiple-policy methods use the vector-valued reward signal and learn a set of policies that approximate the Pareto front. The Pareto optimal solutions is the subset of non-inferior and equally good alternative polices among all distributions in the policy space and multiple-policy methods mainly differ on the approximation scheme for the Pareto front. One common approach is to perform multiple runs [171, 170] of a single scalarized reward function over a set of preferences. Unfortunately, such methods lack scalability to high-dimensional rewards. Other approaches leverage convex approximations 106 of the Pareto front [181] or linear approximations of the value function [182, 183] to learn optimal deterministic policies. Multi-objective fitted Q-iteration [184, 185] enables learning policies for all linear preferences by encapsulating the preference vector as an input to the Q-function approximator. A policy gradient method based on discrete approximations of the Pareto front was proposed in [186] and enhanced to utilize continous approximations in [187]. Single-policy methods have the advantage of learning a single policy and being simple to implement. However, they suffer from instability issues, as a small change on the preferences may lead to performance collapse [188], and also rely on heuristics to infer the preferences when they are not known. On the other hand, multiple-policy methods enjoy the advantage of being able to adapt to changing preferences because the solution (being an approximation to the Pareto front) encapsulates all trade-offs between the objectives. However, this benefit comes at a higher computational cost as we typically have to learn and store multiple policies. In this paper, we aim to bridge the gab between single and multiple policy MORL methods: Our method learns a single policy along with the underlying preference vectors and is able to adapt the inferred preference to any preference distribution. Our preference vectors are inferred, not using heuristics, but by approximating the Pareto front via a first-order condi- tion (see Fig. 7.1 for an example). Our method is inspired by the multiple gradient descent algorithm for MOO introduced in [168] as well as its application in the field of multi-task supervised learning presented in [189]. Contributions: In this chapter, we propose a policy gradient method for multi-objective reinforcement learning under unknown, linear preferences. In the first part, we present Pareto stationarity as a necessary, first-order condition for Pareto optimality and develop a multi- objective policy gradient algorithm which uses a projected gradient descent solver to search for and take steps in a common ascent direction for all objectives. In the second part of the paper, we tackle the problem of adapting the policy gradient to any given preference distribution. We utilize our method from the first part and introduce the Pareto Policy 107 Adaptation (PPA), a loss function that penalizes deviations between the given preference distribution and the recovered preferences. Using implicit differentiation, we are able to back- propagate the gradient of PPA bypassing the operations of the projected gradient descent solver, which makes our method applicable to real-world problems. We evaluate our method in a series of synthetic and real-world domains. 7.2 Fundamentals of Reinforcement Learning 7.2.1 Markov Decision Processes −2.5 −2 −1.5 −1 −0.5 0 0 5 10 15 20 25 time penalty treasure value Pareto Front PPA (ours) Figure 7.1: True and identified Pareto front for Deep Sea Treasure enviroment. We consider the framework of Markov Decision Process (MDP) def = (S,A,R,P), whereS andA denote the (finite) state and (continuous) action spaces, P : S×A×S → [0, 1] is the transi- tion kernel representing the environment dynam- ics, i.e., P(s|s 0 ,a) is the probability of transi- tioning from state s 0 to state s taking action a, r(s,a)≡R :S×A→R is a function mapping state-action tuples to rewards. A stationary con- trol policy π :S×A→ (A), where (A) is the Borel set ofA, is a function that assigns a prob- ability distribution over action for all states. We denote with Π the space of all stationary, stochastic policies. The value v π :S → R |S| of a policy π is defined at state s ∈ S as v π (s) def = E π [ P ∞ t=0 γ t r(s t ,a t )| s 0 = s], where γ∈ (0, 1) is a discount factor and the expectation is taken under the distribution over tra- jectories τ = (s 0 ,a 0 ,s 1 ,a , s 2 ,... ) obtained by starting from s and following a t ∼ π(·|s t ) 108 thereafter. The goal of reinforcement learning is to find a policy π ∗ ∈ Π that maxi- mizes an objective, typically taken to be the expectation under the initial state distribu- tion of the value function, i.e., π ∗ ∈ arg max π∈(Π) E s∼ [v π (s)], where β is the initial state distribution. The discounted state-action occupation measure under policy π is defined as d π (s,a) def = P ∞ t=0 γ t P (s t =s,a t =a| π) for all states s∈S and allows us to write the value function compactly as v π (s) =E d π[r(s,a)| s 0 =s]. 7.2.2 Multi-Objective Markov Decision Processes A Multi-Objective Markov Decision Process (MOMDP) def = (S,A,R,P,,γ, Ω,f Ω ) is the same as an MDP with the rewardr(s,a)≡R :S×A→R M being a vector-valued function and tuple augmented with a space of preferences Ω and a preference functionf ω :R M →R which produces a scalar utility using a preferenceω∈ Ω. In this work, we consider linear preferences f ω (r(s,a)) =ω T r(s,a). Analogously to MDPs, the value v π :S→R M×|S| of a policy π is defined asv π (s) def =E π [ P ∞ t=0 γ t r(s t ,a t )| s 0 = s]. We observe that a fixed preference vector ω∈ Ω allows to directly compare the vectorized reward and value function by comparing their scalarized utilities, and, therefore, the MOMDP collapses to a standard MDP. On the other hand, if we consider all possible returns ˜ r def = P ∞ t=0 γ t r(s t ,a t ), we define the Reward Space Pareto Coverage Set (RS-PCS) or Pareto frontier asF ∗ r def ={˜ r : ˜ r∈ R M , @˜ r 0 ˜ r}, where the symbol denotes Pareto dominance: greater or equal in all objectives and strictly greater in at least one objective. For all possible preferences in Ω and under the linear preference assumption, we define the Reward Space Convex Coverage Set (RS-CCS) as C ∗ R,Ω def ={˜ r : ˜ r∈R,∃ω∈ Ω s.t.ω T ˜ r≥ω T ˜ r 0 ,∀˜ r 0 ∈R M }. (7.1) The setC ∗ R,Ω contains all discounted reward vectors that could be optimal for some preference vector in Ω. Anything inF ∗ r but not inC ∗ R,Ω cannot be useful under linear preferences. In that case, it suffices to restrict our analysis toC ∗ R,Ω . Also, we assume, without loss of 109 generality, that P m ω m = 1,ω m ≥ 0,m = 1, 2...M for allω∈ Ω, i.e., the preference vector is a convex combination of the objectives. 7.3 Learning with Unknown Preferences In this section, we propose a new policy gradient algorithm for MORL under unknown preferences. The key idea of our approach is to infer the preference vector ω, use that ω as scalarization coefficient for the multi-objective criterion and ascent on the gradient of the scalarized value. We start by extending the definition of Pareto optimality from the reward space (given in Sec. 7.2.2) to the policy space. Definition 7.1. A policy π dominates a policy π 0 if and only ifv π v π 0 . A policy π ∗ is called Pareto optimal if there exists no other policy π that dominates π ∗ . The set Π ∗ P of all Pareto optimal policies is called Pareto policy set and its imageF ∗ π def ={v π } π∈Π ∗ P is called Pareto policy frontier. The set C ∗ Π,Ω def ={π :π∈ Π, ∃ω∈ Ω s.t.ω T v π ≥ω T v π 0 , ∀π 0 ∈ Π} (7.2) is called the Policy Space Convex Coverage Set (PS-CCS). Lemma 7.1. Letπ∈C ∗ Π,Ω be a policy. Then, we haveE π [˜ r]∈C ∗ R,Ω , i.e., the expected return vector under policy π belongs in the RS-CCS. Proof. Consider a policy π∈C ∗ Π,Ω and a policy π 0 ∈ Π. Then, for eachω∈ Ω, we can write ω T v π 0 = X s,a d π 0 (s,a)r(s,a) = X s,a d π (s,a) d π 0 (s,a) d π (s,a) r(s,a) =E d π[r 0 (s,a)], (7.3) where we used the importance sampling identity and defined r 0 def = d π 0 (s,a)/d π (s,a)r(s,a). Per Def. 7.2, there existsω∈ Ω such thatω T v π ≥ω T v π 0 . Leveraging the fact that the value 110 function can be expressed asv π =E π [˜ r] as well as Eq. (7.3), we haveω T E π [˜ r]≥ω T E π [˜ r 0 ]. This implies thatE π [˜ r]∈C ∗ R,Ω and concludes the proof. Lemma 7.1 connects the PS-CCS with the RS-CCS and allows us to use the former one in place of the later. This is very convenient because searching for Pareto optimal vectors in the reward space is not practically efficient (e.g., due to the sparsity of reward signal) or useful (e.g., because they cannot be easily mapped to values or actions). By finding a policy π∈C ∗ Π,Ω , Lemma 7.1 guarantees that the expected return vector under π belongs inC ∗ R,Ω and, therefore, maps the problem of reward space exploration to policy space exploration. Additionally, Lemma 7.1 imposes no restrictions on the class of policies and allows us to adopt the common practice of parameterizing the policy by high-dimensional vector θ∈ Θ (e.g., a neural network). The definition of PS-CCS given Eq. (7.4) carries over to this parametric class as is, i.e., we can define the PS-CCS in the policy parameter space as follows C ∗ Θ,Ω def ={θ :θ∈ Θ,∃ω∈ Ω s.t.ω T v π θ ≥ω T v π θ 0 ,∀θ 0 ∈ Θ}. (7.4) To search for a Pareto optimal policy, we leverage a fist-order condition known as Pareto stationarity. Definition 7.2 (Pareto stationarity). A policy π θ is called Pareto stationary if and only if P M m=1 ω m ∇ θ E d π θ [r m (s,a)] = 0 for some ω m ∈ [0, 1], m = 1, 2...M. Pareto stationarity implies that there exists a vanishing convex combination of the ob- jectives’ gradients and is a necessary condition for Pareto optimality [168]. The following theorem connect the PS-CCS with Pareto stationarity. Theorem 7.1. Let π θ ∈C ∗ Θ,Ω be a policy in the PS-CCS. Then, π θ is Pareto stationary. Proof. Letπ θ ∈C ∗ Θ,Ω be a policy andω the corresponding preferences, i.e., we haveω T v π θ ≥ ω T v π θ 0 , ∀θ 0 ∈ Θ. The last relation can be written as ω T E π θ [˜ r]≥ ω T E π θ 0 [˜ r], ∀θ 0 ∈ Θ. 111 This implies that θ is a global maximizer ofω T E π θ [˜ r] or, equivalently, that its gradient at θ vanishes. Therefore, we have ∇ θ ω T E π θ [˜ r] = M X m=1 ω m ∇ θ E d π θ [r m (s,a)] = 0, (7.5) which corresponds to the necessary condition for Pareto stationarity as per Def. 7.2. Theorem 7.1 essentially states that Pareto stationarity is a necessary condition for a policy to belong to PS-CCS. Additionally, as stated in the proof, the convex combination of the objectives that leads to vanishing gradients coincides with the preferences that make the underlying policy part of the PS-CCS. This presents us with a strong necessary condition for recovering the unknown preferences and for identifying an optimal policy. This is because the problem of finding a Pareto stationary is tractable via gradient methods. In fact, if a policy is not Pareto stationary then there exists a common ascent direction for all objectives. To find this descent direction, we use the following theorem. Theorem 7.2 ([168], Theorem 1). Consider the following quadratic optimization problem min ω 1 ,...ω M M X m=1 ω m ∇ θ E d π θ [r m (s,a)] 2 2 s.t M X m=1 ω m = 1, ω m ≥ 0, ∀m o . (P A ) Letω ∗ be the optimal solution of ProblemP A andμ ∗ the underlying minimal-norm objective. Then: 1. either μ ∗ = 0 and π θ ∗ is a Pareto stationary policy 2. or P M m=1 ω ∗ m ∇ θ E d π θ [r m (s,a)] is a common ascent direction for all objectives. Problem P A is essentially equivalent to the formulation of the famous Minimum-Norm Point problem, which arises many fields such as optimal control [190], submodular mini- mization [191] and portfolio optimization [192]. The MNP problem has been well studied and several combinatorial and recursive algorithms have been proposed [193, 194, 195]. Such 112 methods search for the exact solution and typically have exponential run-time [196], which prevents their application in high-dimensional control policies. Therefore, we adopt convex optimization approach for Problem 7.1. We start by noting that the constrain setE ={ω∈R M : P M m=1 ω m = 1, ω m ≥ 0,∀m = 1, 2...M} is essentially the unit simplex and the projection problem, i.e., Π E (ω) = arg min δ∈E kω−δk, can be effi- ciently solved. In fact, there exists a unique dual variable τ∈R such that Π E (ω) = (ω−τ) + , (7.6) where (x) + = max(x, 0). To find the dual variable, we enforce the equality constraint, i.e., P m (ω m −τ) + = 1. This non-linear equation of a single variable can be solved very efficiently using the Newton method. This suggests a projected gradient descent solver for solving Problem P A and recovering the unknown preferences. Our method (Algorithm 6) extends the classical REINFORCE algorithm [197] to take gradient steps in the direction given by the weighted combination of the multi-objective gradient, with the weights being the preferences recovered by the projected gradient descent solver. Lemma 7.1, Theorem 7.1 and 7.2 essentially state that Algorithm 6 learns arbi- trary policies in the Convex Coverage Set along with their underlying preference vectors and provide theoretical justification for our method. Algorithm 6 is simple to implement, com- patible with all policy gradient and actor-critic methods and provides a practical algorithm for learning Pareto optimal policies under unknown preferences. However, it suffers from a subtle drawback: we have no control over the identified preferences and they cannot be optimized to account for any given preference distribution. This is the problem of policy adaptation which we address in the next section. It is worth comparing our projected gradient descent solver with [189], where the authors used the Frank-Wolfe method for Problem P A , albeit in the context of supervised learn- ing. Both algorithms have the same convergence rate ofO(1/N) and since the projection 113 Algorithm 6: Multi-Objective Policy Gradient Inputs: Initial parameters θ 0 , learning rate α 1 foreach k = 0, 1,... do 2 Collect trajectoriesD ={τ i }, τ i ∼π θ k Compute rewards-to-gor t g m ← 1 |D| P τ,t r t ∇ θ logπ θ k (a t |s t ) 3 (ω 1 ,...ω M )←− PGDSolver(g 1 ,...g M ,α) 4 θ k+1 ←θ k + P M m=1 ω m g m 5 end 6 procedure PGDSolver(g 1 ,...,g M ,α): 7 Initialize feasibleω = (ω 1 ,...ω M ) while not converged do 8 Find τ solving P m (ω m −τ) + = 1 9 ω m ←ω m +αg T m P n e n g n , m = 1...M 10 ω← (ω−τ) + 11 end 12 return (ω 1 ,...ω M ) to the constraint set can be easily calculated, we opted for the projected gradient descent optimizer. The overhead added by solving for the dual variable τ is minimal because the Newton’s method is known to enjoy quadratic convergence rate and is very efficient in prac- tice. The trade-off of using Frank-Wolfe is that requires to obtain the feasible minimizer of the linear approximation of the objective around the current iterant, which is typically solved approximately and worsens the convergence rate by a multiplicative constant [198]. Nonetheless, we expect that both methods can be used interchangeably in practice. 7.4 Pareto Policy Adaptation The policy gradient algorithm presented in the previous section finds policies that are optimal (i.e., Pareto stationary) for the inferred preferences. In this section, we show how to address the problem of adapting the policy to be optimal for a given distributionD ω over preferences. We start by interpreting the optimal solution of Problem P A as a function of the policy parameters: ω ∗ def =ω ∗ (θ) = (ω 1 (θ),ω 2 (θ),...ω M (θ)) (7.7) 114 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ω ∗ v π 1 v π 2 (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ω 0 ω ∗ (θ) v π 1 v π 2 (b) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ω 0 ω ∗ (θ) ω 1 v π 1 v π 2 (c) Figure 7.2: (a) The Convex Coverage Set (CCS) for a 2-dimensional value function. Algorithm 6 identifies arbitrary solutions on the Pareto front. (b) Illustration of the PPA loss function for a single preference vectorω 0 : by minimizing the distance ω ∗ (θ)−ω 0 , the solution converges to the optimal solution corresponding to preferenceω 0 . (c) PPA loss function for two preference vectors ω 0 ,ω 1 : the minimizer of ω ∗ (θ)−ω 0 + ω ∗ (θ)−ω 1 (also called the geometric median) gives the optimal solution because it maximizes the projections onto the preference vectors. We introduce the Pareto Policy Adaptation (PPA) loss function as follows: L(θ) =E d π θ,ω λ M X m=1 ω ∗ m (θ)r m (s,a)− (1−λ)kω ∗ (θ)−ωk 2 2 , (7.8) where λ∈ [0, 1] is weight that trades off between the two losses. The first term of PPA is the scalarized reward, with the scalarization coefficient being the inferred preferences. The second term penalizes the expected deviation between the given and inferred preferences. To further motivate the PPA loss, consider the special case of a single, given preference vector ω 0 . In this case, the MOMDP collapses to a standard MDP, withω 0 being the scalarization coefficient for the reward and value functions. Additionally, assume that the policy π θ has been adapted to that preference vector, i.e., it achieves ω ∗ (θ) = ω 0 for some parameter vector θ. Then, our loss function in Eq. (7.8) reduces to L(θ) =E d π θ h M X m=1 ω 0 m r m (s,a) i . (7.9) Maximization of Eq. (7.9) corresponds to taking ascent steps along P m ω 0 m ∇ θ E d π θ [r m (s,a)], which, per Theorem 7.2, is an ascent direction guaranteed to increase all objectives. This simplified scenario essentially states the if we exactly match the preferences learned from 115 Problem P A to the given preference vectorω 0 , then not only the PPA loss function reduces to the loss function for the scalarized reward but also the ascent steps along its gradient lead to concurrent improvements of all objectives. We show a geometric illustration of the PPA loss function in Fig. 7.2. Despite its simplicity, one important obstacle in using the PPA loss to train agents in practice is that the optimal solution of Problem P A is analytically known only for M = 2 (solution is presented in the Appendix). For M > 2, it can rather only be approximated by the iterative projected gradient descent solver introduced in Algorithm 6. This makes the backward pass challenging because the gradients need to be back-propagated through the (perhaps thousands) iterations of the projected gradient descent solver. Since back- propagation is known to be a very expensive operation compared to the forward pass, this could make our approach inefficient for practical problems. The problem of back-propagating through solutions of numerical solvers arises in several deep learning architectures such as deep equilibrium models [199] and optimization-based neural network layers [200]. Luckily, the implicit value theorem allows us to back-propagate the gradient bypassing the operations of the projected gradient descent solver. Theorem 7.3. Let l def = l(ω ∗ (θ)) be a generic loss function that depends explicitly on the the optimal solutionω ∗ (θ) of Problem P A . Then, the gradient of l with respect to the policy parameters is ∇ θ l = H T (θ) 0 0 G(θ) −I M e −D(μ ∗ (θ)) −D(ω ∗ (θ)) 0 e T 0 0 −T dl dω ∗ 0 0 , (7.10) where [G(θ)] m,n = 2g T m (θ)g n (θ), [H(θ)] m,· =− P n ω ∗ n (θ)∇ T θ (g T n (θ)g m (θ)),μ ∗ (θ) is the optimal dual variable,e and 0 are the identity and zero vectors of appropriate dimensions, I M is the identity matrix of dimension M and D(·) create a diagonal matrix from a vector. When 116 applied on a vector quantity, the operator∇ θ denotes the Jacobian matrix and the operator ∇ T θ denotes the transpose of the underlying gradient or Jacobian. The proof of Theorem 7.3 (given in the Appendix) leverages the implicit value theorem which enables us to differentiate the KKT optimality conditions of ProblemP A with respect to the policy parameters. Observe that matrix H(θ) is essentially of dimension M×d θ , d θ being the dimension ofθ. This dependence ond θ makes the solution of Eq. (7.10) inefficient, as we need to compute and store a matrix of size (2M + 1)d θ . However, in practice, we are not interested in knowing the full gradient ofω ∗ (θ), but rather in back-propagating the gradient of the loss function. Eq. (7.10) has an appealing form which allows us to use vector- Jacobian products (VJPs) to evaluate the gradient∇ θ l one row at a time. This frees us from having to store matrixH(θ) and compute its product with another matrix. We observe that the solution of Eq. (7.10) requires knowledge of the optimal dual variable μ ∗ (θ), which is not directly available. However, the optimal primal solution ω ∗ (θ) given by the projected gradient descent solver (Algorithm 6) can be utilized to obtain μ ∗ (θ) by substituting it into the KKT optimality conditions of Problem P A (details in the Appendix) and solving the resulting linear system. That system has M + 1 variables, therefore its solution adds minimal computational overhead. 7.5 Experimental Evaluation In this section, we evaluate the performance of our proposed method. In our implementation, we extended the popular Proximal Policy Optimization (PPO) [201] algorithm so that it can handle our PPA loss function. We implemented all experiments on PyTorch using modified versions of OpenAI Gym environments. We give the details of our experimental setup in the Appendix. Domains: We evaluate on 4 environments (details given in the Appendix): (a) Multi- Objective GridWorld (MOWG): a variant of the classical gridworld, (b) Deep Sea Treasure 117 (DST): the classical multi-objective reinforcement learning enviroment, (c) Mulit-Objective SuperMario (MOSM): modified, multi-objective variant of the popular video game that has a 5-dimensional reward signal and (d) Multi-Objective MuJoCo (MOMU): a modified version of the MuJoCo physics simulator, focusing on locomotion tasks. The first two are simplified control tasks where the Pareto frontier is easy to identify (see example in Fig. 7.1) and act as proof-of-concept for our method. With the remaining two environments we aim to show that our method is effective for handling vector reward signals in real-world problems. Ablation Study and Baselines: To highlight the extend to which our results are driven by the PPA loss function, we perform an ablation study. In more details, we consider the following three agents: (a) Pareto Policy Adaptation (PPA): This is our original loss function as introduced in Sec. 7.4. (b) non-adaptive PPA (na-PPA): Same as PPA but the preferences are static, i.e., the policy is not adapted to the preference distribution. This agent is essentially Algorithm 6 without the PPA loss function. (c) Fixed Preferences: Same as na-PPA but the preferences are given. This corresponds to optimizing the scalarized objective. Additionally, we compare the performance of our method with a range of state-of- the-art baselines. In particular, we compare against: the Multi-Objective Fitted Q-iteration (MOFTQI) [185], the Conditional Neural network with Optimistic Linear Support (CN- OLS)[177, 178], the Conditioned Network with Diverse Experience Replay (CN-DER) [178], the Envelope Q-Learning (EQL) [179]. Evaluation Metrics: In the MORL literature, single and multiple policy algorithms have different evaluation metrics. Since our method lies in the intersection of single and multiple policy algorithms, we use metrics inspired by both classes of methods: 1. Pareto Dominance (PD): Let ˜ r 1 t , ˜ r 2 t be the return vectors of trajectories sampled from agent 1 and 2 (resp.). We define the Pareto Dominance as PD 1,2 def = 1 T P T t=0 1(˜ r 1 t ˜ r 2 t ), where 1(·) is the indicator function. This is essentially the fraction of time when the return vector under agent 1 Pareto dominates the return vector under agent 2. Note that PD 1,2 +PD 2,1 ≤ 1. 118 2. Utility (UT): For a return vector ˜ r t sampled from any algorithm, we define the Utility metric as UT def = 1 T P T t=0 E ω [ω T ˜ r t ], where the expected value is evaluated by sampling uniformly from the preference space Ω or from the preference distributionD ω (if avail- able). The PD metric is useful for comparing agents pair-wise and is simple to calculate. One the other hand, UT can compare multiple agents at the expense of computing an expectation over preferences, which may become challenging for higher-dimensional rewards. 0 100000 200000 300000 400000 500000 frames 0.3 0.4 0.5 0.6 0.7 0.8 0.9 episode return Multi-Objective GridWorld Episode Reward 0 100000 200000 300000 400000 500000 frames 80 70 60 50 40 30 20 10 time penalty Multi-Objective GridWorld Episode Time Penalty 0 100000 200000 300000 400000 500000 frames 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 episode return Deep Sea Treasure Episode Reward 0 100000 200000 300000 400000 500000 frames 60 50 40 30 20 10 time penalty Deep Sea Treasure Episode Time Penalty PPA na-PPA 1 = 0.5 1 = 0.75 1 = 0.25 Figure 7.3: Results on the MOGW and DST environ- ments. Main Results: Figure 7.3 shows the results of the ablation study con- ducted on the MOGW and DST do- mains. We compare against the non-adaptive and the fixed-preferences agents. For the later case, we pick three different preference vectors: (ω 1 ,ω 2 ) = (0.5, 0.5), (ω 1 ,ω 2 ) = (0.75, 0.25) and (ω 1 ,ω 2 ) = (0.25, 0.75). These plots facilitate the visual identification of Pareto dominant agents. Focusing on the MOGW (upper row), we observe that ourPPA agent converges to a policy that yields a higher reward and lower time-penalty than any other agent, i..e, PPA Pareto dominates all other agents. While a similar statement cannot be made for the DST domain, we note that PPA is not Pareto dominated by any other agent and the only agent that it does not dominate is the fixed-preference one for 119 (ω 1 ,ω 2 ) = (0.5, 0.5). On the other hand, the non-adaptive PPA (na-PPA) seems to perform relatively poor. However, there exists some merit in using it because, in contrast to the fixed-preference agents, na-PPA does not rely on being given a preference vector. It is also worth noting that in the DST domain the preference vector (ω 1 ,ω 2 ) = (0.25, 0.75) gives very poor results, which implies that handpicking the preferences might lead to performance collapse. Overall, these results suggest that PPA steadily achieves better performance and is superior than the other agents. It is interesting to observe that the reward and time penalty curves are quite similar in the MOWG domain. This agrees with our intuition because in the MOGW domain the reward and time penalty have an ”inverse” relationship (i.e., the sooner we reach the bottom right, the higher the reward). Method MOGW DST MOSM PD UT PD UT PD UT MOFQI 0.35 ±0.07 0.07 ±0.04 0.38 ±0.03 −0.18 ±0.07 0.35 ±0.10 0.10 ±0.09 CN-OLS 0.51 ±0.06 0.19 ±0.06 0.55 ±0.05 0.14 ±0.04 0.47 ±0.05 0.15 ±0.10 CN-DER 0.46 ±0.05 −0.12 ±0.05 0.39 ±0.08 −0.11 ±0.03 0.41 ±0.02 −0.07 ±0.07 EQL 0.53 ±0.04 0.18 ±0.02 0.48 ±0.07 0.20 ±0.06 0.50 ±0.08 0.24 ±0.06 nA-PPA 0.33 ±0.06 0.08 ±0.04 0.38 ±0.07 0.12 ±0.04 0.42 ±0.04 0.14 ±0.05 PPA 0.65 ±0.09 0.28 ±0.04 0.75 ±0.06 0.21 ±0.05 0.64 ±0.03 0.27 ±0.04 Table 7.1: Comparison of MORL agents using the Pareto Dominance (PD) and Average Utility (UT). Possible values for each metric lie in the range [0, 1] and [−1, 1], respectively. In both cases, higher values indicate better performance. We evaluate the agents using 10 random seeds and report the first and second order statistics. In the next series of experiments, we compare our method with state-of-the are MORL algorithms (Table 7.1). In this comparison, we use both the Pareto Dominance (PD) and Average Utility (UT) metrics. Since PD is a pair-wise metric, we first compute the pair-wise PD i,j values for all agents and then average-out the j component to obtain a single value for each agent. Observe that this metric is specific to the ensemble of agents that we consider and needs to be re-calculated if we compare a different set of agents. The UT metric is calculated by sampling preferences from the preference space of each domain. We assumed no prior knowledge on the preferences, so we performed uniform sampling of 1000 points. 120 In the MOGW and DST case, where the preference space is one-dimensional (recall that preferences are assumed to lie in the unit simplex), this sampling posses no computational challenge. However, in the MOSM case, where the preferences are 4-dimensional, which we compact by sampling a modest number of 100 preferences on each preference dimension. Additionally, given that the preferences lie in the unit simplex, we scale the rewards in the range [−1, 1] across each dimension. This scaling is important when calculating the average utility but has no influence on the Pareto dominance metric. Observe that some agents have negative utility, which is indeed possible given the aforementioned reward scaling. Our method systematically out-performs all other baselines across all domains, which is another strong empirical finding for its superior performance. In the last line of experiments, we evaluate the performance of PPA in continous control tasks. We modified the MuJoCo Gym interface to output 2-dimensional rewards on each step. The first one is the commonly used reward in such environments, i.e., the difference between two sequential positions of the robot. The second one rewards the robot for moving in circular paths. We show the results on Fig. 7.4. We compare against the ablated agents and show the Pareto Dominance (PD) and Average Utility (UT) metrics. The only case where our method does not attain the best performance is the Hopper-v3 environment under the UT metric, where the performance of the fixed-preferences agent (ω 1 ,ω 2 ) = (0.75, 0.25) slightly surpasses the performance of PPA. However, observe that the same agent attains very poor performance in the HalfCheetah-v3 enviroment. This suggests that, given the complex inter- dependence between the two rewards, simply hand-crafting a preference vector is far from optimal and a preference vector that yields good performance in one enviroment may perform poorly in a different enviroment or judged under a different metric. However, PPA shows a consistently better performance across all enviroments and under both metrics. 121 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.3 0.4 0.5 0.6 0.7 pareto dominance Humanoid-v3 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 pareto dominance Hopper-v3 0.0 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 pareto dominance HalfCheetah-v3 0.0 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 pareto dominance Ant-v3 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.2 0.1 0.0 0.1 0.2 0.3 0.4 average utility Humanoid-v3 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 average utility Hopper-v3 0.0 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 average utility HalfCheetah-v3 0.0 0.2 0.4 0.6 0.8 1.0 frames 1e6 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 average utility Ant-v3 PPA na-PPA 1 = 0.5 1 = 0.75 1 = 0.25 Figure 7.4: PPA on MuJoCo environments. We show the Pareto Dominance (PD) and Average Utility (UT) metrics. Experiments are run for 10 random seeds. We compare against the non-adaptive PPA and the fixed-preferences agents. 7.6 Summary and Outlook We introduced a policy gradient algorithm for Multi-Objective Reinforcement Learning (MORL) under linear preferences. By approximating the Pareto front via a first-order necessary condition, our method learns a single policy that can be adapted to any pref- erence distribution. The two pillars of our method is a projected gradient descent solver that searches for common ascent direction for all objectives and a novel Pareto Policy Adap- tation (PPA) loss function that leverages the solutions of that solver to adapt the policy to any preference distribution. Our method enjoys the advantages of both single and multiple policy MORL algorithms while being simple to implement and compatible with all policy gradient and actor-critic methods. We demonstrated the effectiveness of our method in sev- eral reinforcement learning tasks. One limitation of our current work is the linear preference assumption, which may be over-simplifying for some problems. In the case of non-linear preferences, our method can be still applied but should be seen as a heuristic. Furthermore, the development of our method relies on Pareto stationarity, which is a necessary but not 122 sufficient condition for Pareto optimality. i.e., there is no guarantee for Pareto optimality only stationarity. Nonetheless, Pareto optimality is hard to be achieved in general and we often need to resort to first-order approximations. 123 Chapter 8 Summary and Future Research This thesis lays the foundation of modern CPS modeling, analysis, learning, and control by identifying a range of novel problems and by addressing them using an ensemble of tools from formal verification, topological data analysis, persistent homology, variational in- ference, complex networks, non-Markovian dynamic and multi-objective optimization. We make multiple, distinct and independent contributions in the fields of verification, learning and control. In the field of verification, we identify systemic and environmental uncertainty as a major challenge in modern, complex CPS and we introduce Stochastic Temporal Logic (StTL) as a formal language for expressing specifications over stochastic dynamical systems. We leverage our formalism for specification mining and robust design under uncertainty as well as for designing safe control behaviors from imperfect demonstrations. In the field of learning, we develop methods to learn structured feature representations from CPS sensory and vision data. On the one hand, we use persistent homology as a tool for extracting topological features and we learn to represent such features in hyperbolic spaces. On the other hand, we use variational models to capture the hierarchical structure of the latent feature space. We apply both methods to several learning tasks such as image classification, graph classification and image reconstruction. Our results show a substantial improvement of several performance metric when compared with the state of the art and indicate that the topological and hierarchical view of the feature space are the main drivers of that per- formance. Finally, in the field of CPS control, we address the challenges of non-Markovian modeling assumptions and multi-criteria, data-driven control. For the former case, we model non-Markovianity via fractional order models and study how the topological structure affects the controllability properties. For the later case, we explore gradient descent methods for 124 the data-driven optimization of multi-objective control criteria. We implement and test all of our algorithms and methods to show their effectiveness in wide spectrum of experimental setups and case studies and we compare against state of the art approaches. This work paves the way for different research problems that could be explored. One potential research direction is to further combine StTL with reinforcement learning. In more detail, we could consider an StTL-constrained reinforcement learning agent and develop policy gradient methods that can design an optimal policy that maximizes the discounted expected reward subject to satisfying an StTL constrain. Such an method is of pivotal importance in providing formal guarantees for the safety of reinforcement learning systems. Additionally, a different research direction is the multi-agent extension of our multi-objective policy gradient method. Our PPA algorithm is a single-agent algorithm, i.e., it finds a single policy that trades-off between the different objectives. In the multi-agent case, we optimize over different policies and possible conflicting or competing objectives. Most successful ap- plications of reinforcement learning, i.e., video games, robotics, autonomous driving, involve the interaction of more that one agents. Therefore, there is a great merit in considering the multi-agent extension of PPA, designing optimal policies for each agent under differ- ent assumptions (e.g., complete or incomplete information) and studying the equilibrium properties of the game. 125 Bibliography [1] J. V. Deshmukh, P. Kyriakis, and P. Bogdan, “Stochastic temporal logic abstractions: Challenges and opportunities,” in Formal Modeling and Analysis of Timed Systems - 16th International Conference, FORMATS 2018, Beijing, China, September 4-6, 2018, Proceedings, pp. 3–16, 2018. [2] P. Kyriakis, J. V. Deshmukh, and P. Bogdan, “Specification mining and robust de- sign under uncertainty: A stochastic temporal logic approach,” ACM Trans. Embed. Comput. Syst., vol. 18, pp. 96:1–96:21, Oct. 2019. [3] P. Kyriakis, J. V. Deshmukh, and P. Bogdan, “Learning from demonstrations under stochastic temporal logic specifications,” in 2022 American Control Conference (ACC), 2022. [4] P. Kyriakis, I. Fostiropoulos, and P. Bogdan, “Learning hyperbolic representations of topological features,” in International Conference on Learning Representations, 2021. [5] E. Miliotou, P. Kyriakis, J. D. Hinman, A. Irimia, and P. Bogdan, “Neural decoding of visual imagery via hierarchical variational autoencoders,” in Neural Information Processing Symposium, 2022. Submitted. [6] P. Kyriakis, S. Pequito, and P. Bogdan, “On the effects of memory and topology on the controllability of complex dynamical networks,” Scientific Reports, vol. 10, p. 17346, Oct 2020. [7] P. Kyriakis, S. Pequito, and P. Bogdan, “Actuator placement for heterogeneous com- plex dynamical networks with long-term memory,” in 2020 American Control Confer- ence (ACC), pp. 4671–4676, 2020. [8] P. Kyriakis, J. Deshmukh, and P. Bogdan, “Pareto policy adaptation,” in International Conference on Learning Representations, 2022. [9] S. Jha, V. Raman, D. Sadigh, and S. A. Seshia, “Safe autonomy under perception un- certainty using chance-constrained temporal logic,” J. Autom. Reason., vol. 60, pp. 43– 62, Jan. 2018. [10] J. Ding, M. Kamgarpour, S. Summers, A. Abate, J. Lygeros, and C. Tomlin, “A stochastic games framework for verification and control of discrete time stochastic hybrid systems,” Automatica, vol. 49, no. 9, pp. 2665–2674, 2013. 126 [11] M. Althoff, O. Stursberg, and M. Buss, “Model-based probabilistic collision detection in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, pp. 299–310, 2009. [12] M. Ghorbani and P. Bogdan, “Reducing risk of closed loop control of blood glucose in artificial pancreas using fractional calculus,” in Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, pp. 4839– 4842, IEEE, 2014. [13] C. Chatfield, The analysis of time series: an introduction. CRC press, 2016. [14] X. Jin, A. Donz´ e, J. V. Deshmukh, and S. A. Seshia, “Mining requirements from closed- loop control models,” IEEE Trans. Comp. Aided Design, vol. 34, no. 11, pp. 1704–1717, 2015. [15] B. Hoxha, A. Dokhanchi, and G. Fainekos, “Mining parametric temporal logic prop- erties in model-based design for cyber-physical systems,” International Journal on Software Tools for Technology Transfer, vol. 20, no. 1, pp. 79–93, 2018. [16] G. Chen, Z. Sabato, and Z. Kong, “Active learning based requirement mining for cyber- physical systems,” in Decision and Control (CDC), 2016 IEEE 55th Conference on, pp. 4586–4593, 2016. [17] O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” in Proceedings of FORMATS/FTRTFT, pp. 152–166, 2004. [18] X. Jin, J. V. Deshmukh, J. Kapinski, K. Ueda, and K. Butts, “Powertrain control verification benchmark,” in Proceedings of the 17th international conference on Hybrid systems: computation and control, pp. 253–262, ACM, 2014. [19] H. Roehm, R. Gmehlich, T. Heinz, J. Oehlerking, and M. Woehrle, “Industrial exam- ples of formal specifications for test case generation,” in Workshop on Applied veRifi- cation for Continuous and Hybrid Systems, ARCH@CPSWeek 2015, pp. 80–88, 2015. [20] B. Hoxha, H. Abbas, and G. Fainekos, “Benchmarks for temporal logic requirements for automotive systems,” in ARCH14-15. 1st and 2nd International Workshop on Applied veRification for Continuous and Hybrid Systems (G. Frehse and M. Althoff, eds.), vol. 34 of EPiC Series in Computing, pp. 25–30, EasyChair, 2015. [21] J. Kapinski, X. Jin, J. Deshmukh, A. Donz´ e, T. Yamaguchi, H. Ito, T. Kaga, S. Kobuna, and S. A. Seshia, “ST-Lib: A Library for Specifying and Classifying Model Behaviors,” in SAE Technical Paper, SAE, 2016. [22] D. Sadigh and A. Kapoor, “Safe control under uncertainty with probabilistic signal temporal logic,” in Robotics Science and Systems, 2016. [23] J. Li, P. Nuzzo, A. Sangiovanni-Vincentelli, Y. Xi, and D. Li, “Stochastic contracts for cyber-physical system design under probabilistic requirements,” in ACM/IEEE Int. Conf. on Formal Methods and Models for System Design, 2017. 127 [24] J. V. Deshmukh, P. Kyriakis, and P. Bogdan, “Stochastic temporal logic abstractions: Challenges and opportunities,” in International Conference on Formal Modeling and Analysis of Timed Systems, pp. 3–16, Springer, 2018. [25] B. W. Silverman, Density estimation for statistics and data analysis, vol. 26. CRC press, 1986. [26] Y. Xue and P. Bogdan, “Constructing compact causal mathematical models for com- plex dynamics,” in Proceedings of the 8th International Conference on Cyber-Physical Systems, ICCPS ’17, pp. 97–107, 2017. [27] E. Asarin and O. Maler, “Achilles and the tortoise climbing up the arithmetical hier- archy,” Journal of Computer and System Sciences, vol. 57, no. 3, pp. 389–398, 1998. [28] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004. [29] O. Maler, “Learning Monotone Partitions of Partially-Ordered Domains (Work in Progress).” working paper or preprint, July 2017. [30] A. Donz´ e, T. Ferr` ere, and O. Maler, “Efficient robust monitoring for stl,” in Proceedings of the 25th International Conference on Computer Aided Verification, Proceedings of Computer-Aided Verification’13, (Berlin, Heidelberg), pp. 264–279, Springer-Verlag, 2013. [31] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimization,” in Evolutionary Programming VII (V. W. Porto, N. Saravanan, D. Waagen, and A. E. Eiben, eds.), (Berlin, Heidelberg), pp. 591–600, Springer Berlin Heidelberg, 1998. [32] E. Asarin, A. Donz´ e, O. Maler, and D. Nickovic, “Parametric identification of temporal properties,” in Proceedings of Runtime Verification, pp. 147–160, 2011. [33] J. Legriel, C. Le Guernic, S. Cotton, and O. Maler, “Approximating the pareto front of multi-criteria optimization problems,” in Proceedings of TACAS, pp. 69–83, 2010. [34] R. Verma, D. D. Vecchio, and H. K. Fathy, “Development of a scaled vehicle with longitudinal dynamics of an hmmwv for an its testbed,” IEEE/ASME Transactions on Mechatronics, vol. 13, pp. 46–57, Feb 2008. [35] A. A. Julius and G. J. Pappas, “Approximations of stochastic hybrid systems,” IEEE Transactions on Automatic Control, vol. 54, pp. 1193–1203, June 2009. [36] A. Abate, A. D’Innocenzo, and M. D. D. Benedetto, “Approximate abstractions of stochastic hybrid systems,” IEEE Transactions on Automatic Control, vol. 56, pp. 2688–2694, Nov 2011. [37] M. Kamgarpour, J. Ding, S. Summers, A. Abate, J. Lygeros, and C. Tomlin, “Discrete time stochastic hybrid dynamical games: Verification amp; controller synthesis,” in 2011 50th IEEE Conference on Decision and Control and European Control Confer- ence, pp. 6122–6127, Dec 2011. 128 [38] J. Fu and U. Topcu, “Computational methods for stochastic control with metric in- terval temporal logic specifications,” in Proceedings of the International Conference on Decision and Control (CDC), pp. 7440–7447, IEEE, 2015. [39] A. Rizk, G. Batt, F. Fages, and S. Soliman, “On a continuous degree of satisfaction of temporal logic formulae with applications to systems biology,” in Computational Meth- ods in Systems Biology (M. Heiner and A. M. Uhrmacher, eds.), (Berlin, Heidelberg), pp. 251–268, Springer Berlin Heidelberg, 2008. [40] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust sat- isfaction of signal temporal logic specifications,” in IEEE 55th Conference on Decision and Control (CDC), pp. 6565–6570, Dec 2016. [41] G. E. Fainekos and G. J. Pappas, “Robustness of temporal logic specifications for continuous-time signals,” Theoretical Computer Science, vol. 410, no. 42, pp. 4262 – 4291, 2009. [42] A. Donz´ e, T. Ferr` ere, and O. Maler, “Efficient robust monitoring for STL,” in Pro- ceedings of Computer-Aided Verification, pp. 264–279, 2013. [43] A. Bakhirkin, T. Ferr` ere, and O. Maler, “Efficient parametric identification for stl,” in Proceedings of the 21st International Conference on Hybrid Systems: Computation and Control, pp. 177–186, ACM, 2018. [44] E. Bartocci, L. Bortolussi, L. Nenzi, and G. Sanguinetti, “On the robustness of tem- poral properties for stochastic models,” in Proceedings of the Second International Workshop on Hybrid Systems and Biology, HSB 2013, Taormina, Italy, 2nd Septem- ber 2013., pp. 3–19, 2013. [45] H. Hansson and B. Jonsson, “A logic for reasoning about time and reliability,” Formal Aspects of Computing, vol. 6, pp. 512–535, Sep 1994. [46] M. Lahijanian, S. B. Andersson, and C. Belta, “Control of markov decision processes from pctl specifications,” in Proceedings of the 2011 American Control Conference, pp. 311–316, June 2011. [47] T. Br´ azdil, K. Chatterjee, M. Chmel´ ık, V. Forejt, J. Kˇ ret´ ınsk´ y, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of markov decision processes using learning algorithms,” in Automated Technology for Verification and Analysis, pp. 98–114, 2014. [48] C. Baier, M. Gr¨ oßer, M. Leucker, B. Bollig, and F. Ciesinski, “Controller synthesis for probabilistic systems,” in Exploring New Frontiers of Theoretical Informatics, pp. 493– 506, Springer, 2004. [49] X. Zhang, B. Wu, and H. Lin, “Learning based supervisor synthesis of pomdp for pctl specifications,” in Decision and Control (CDC), 2015 IEEE 54th Annual Conference on, pp. 7470–7475, IEEE, 2015. 129 [50] A. Legay, B. Delahaye, and S. Bensalem, “Statistical model checking: An overview,” in International conference on Runtime Verification, pp. 122–135, Springer, 2010. [51] E. M. Clarke and P. Zuliani, “Statistical model checking for cyber-physical systems,” in International Symposium on Automated Technology for Verification and Analysis, pp. 1–12, Springer, 2011. [52] Z. Kong, A. Jones, A. Medina Ayala, E. Aydin Gol, and C. Belta, “Temporal logic inference for classification and prediction from data,” in Proceedings of the 17th inter- national conference on Hybrid systems: computation and control, pp. 273–282, ACM, 2014. [53] A. Jones, Z. Kong, and C. Belta, “Anomaly detection in cyber-physical systems: A formal methods approach,” in 53rd IEEE Conference on Decision and Control, pp. 848– 853, IEEE, 2014. [54] H. Edelsbrunner and J. Harer, “Persistent homology—a survey,” Discrete & Compu- tational Geometry - DCG, vol. 453, 01 2008. [55] H. Edelsbrunner and D. Morozov, “Persistent homology: theory and practice,” Euro- pean Congress of Mathematics, pp. 31–50, 01 2014. [56] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete & Computational Geometry, vol. 37, no. 1, pp. 103–120, 2007. [57] K. Turner, S. Mukherjee, and D. M. Boyer, “Persistent homology transform for mod- eling shapes and surfaces,” Information and Inference: A Journal of the IMA, vol. 3, pp. 310–344, Dec 2014. [58] G. Kusano, K. Fukumizu, and Y. Hiraoka, “Persistence weighted gaussian kernel for topological data analysis,” in Proceedings of the 33rd International Conference on In- ternational Conference on Machine Learning - Volume 48, ICML’16, p. 2004–2013, JMLR.org, 2016. [59] M. Carri` ere, M. Cuturi, and S. Oudot, “Sliced wasserstein kernel for persistence di- agrams,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 664–673, JMLR.org, 2017. [60] P. Bubenik, “Statistical topological data analysis using persistence landscapes,” Jour- nal of Machine Learning Research, vol. 16, no. 3, pp. 77–102, 2015. [61] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt, “A stable multi-scale kernel for topological machine learning,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4741–4748, 2015. [62] T. Le and M. Yamada, “Persistence fisher kernel: A riemannian manifold kernel for per- sistence diagrams,” in Advances in Neural Information Processing Systems 31 (S. Ben- gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 10007–10018, Curran Associates, Inc., 2018. 130 [63] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepush- tanova, E. Hanson, F. Motta, and L. Ziegelmeier, “Persistence images: A stable vector representation of persistent homology,” Journal of Machine Learning Research, vol. 18, no. 8, pp. 1–35, 2017. [64] F. Chazal, B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman, “Stochastic con- vergence of persistence landscapes and silhouettes,” in Proceedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG’14, (New York, NY, USA), p. 474–483, Association for Computing Machinery, 2014. [65] B. Di Fabio and M. Ferri, “Comparing persistence diagrams through complex vectors,” in Image Analysis and Processing — ICIAP 2015 (V. Murino and E. Puppo, eds.), (Cham), pp. 294–305, Springer International Publishing, 2015. [66] A. Adcock, E. Carlsson, and G. Carlsson, “The ring of algebraic functions on persis- tence bar codes,” Homology, Homotopy and Applications, vol. 18, 04 2013. [67] S. Kalisnik, “Tropical coordinates on the space of persistence barcodes,” Foundations of Computational Mathematics, vol. 19, no. 1, pp. 101–129, 2019. [68] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl, “Deep learning with topological signatures,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA), p. 1633–1643, Curran Associates Inc., 2017. [69] C. D. Hofer, R. Kwitt, and M. Niethammer, “Learning representations of persistence barcodes,” Journal of Machine Learning Research, vol. 20, no. 126, pp. 1–45, 2019. [70] M. Carri` ere, F. Chazal, Y. Ike, T. Lacombe, M. Royer, and Y. Umeda, “Perslay: A neural network layer for persistence diagrams and new graph topological signatures,” in The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy] (S. Chiappa and R. Calandra, eds.), vol. 108 of Proceedings of Machine Learning Research, pp. 2786–2796, PMLR, 2020. [71] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geomet- ric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017. [72] M. Nickel and D. Kiela, “Poincar´ e embeddings for learning hierarchical representa- tions,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 6338–6347, Curran Associates, Inc., 2017. [73] M. Nickel and D. Kiela, “Learning continuous hierarchies in the lorentz model of hyperbolic geometry,” in Proceedings of the 35th International Conference on Ma- chine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018 (J. G. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, pp. 3776–3785, PMLR, 2018. 131 [74] F. Sala, C. De Sa, A. Gu, and C. Re, “Representation tradeoffs for hyperbolic em- beddings,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, (Stockholmsm¨ assan, Stockholm Sweden), pp. 4460–4469, PMLR, 10–15 Jul 2018. [75] O. Ganea, G. Becigneul, and T. Hofmann, “Hyperbolic neural networks,” in Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 5345–5355, Curran Asso- ciates, Inc., 2018. [76] C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, and N. de Freitas, “Hyperbolic at- tention networks,” in International Conference on Learning Representations, 2019. [77] Q. Liu, M. Nickel, and D. Kiela, “Hyperbolic graph neural networks,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds.), pp. 8230–8241, Curran Associates, Inc., 2019. [78] I. Chami, Z. Ying, C. R´ e, and J. Leskovec, “Hyperbolic graph convolutional neural networks,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds.), pp. 4868– 4879, Curran Associates, Inc., 2019. [79] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, (New York, NY, USA), p. 1365–1374, Association for Computing Machinery, 2015. [80] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. 77, pp. 2539–2561, 2011. [81] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in Proceedings of the 33rd International Conference on International Confer- ence on Machine Learning - Volume 48, ICML’16, p. 2014–2023, JMLR.org, 2016. [82] Q. Zhao and Y. Wang, “Learning metrics for persistence-based summaries and applica- tions for graph classification,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada (H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´ e-Buc, E. B. Fox, and R. Garnett, eds.), pp. 9855–9866, 2019. [83] S. Ivanov and E. Burnaev, “Anonymous walk embeddings,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, (Stockholmsm¨ assan, Stockholm Sweden), pp. 2186–2195, PMLR, 10–15 Jul 2018. 132 [84] C. Hofer, F. Graf, B. Rieck, M. Niethammer, and R. Kwitt, “Graph filtration learning,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, (Virtual), pp. 4314–4323, PMLR, 13–18 Jul 2020. [85] J.-D. Haynes and G. Rees, “Decoding mental states from brain activity in humans,” Nature Reviews Neuroscience, vol. 7, pp. 523–534, Jul 2006. [86] K. N. Kay, T. Naselaris, R. J. Prenger, and J. L. Gallant, “Identifying natural images from human brain activity,” Nature, vol. 452, pp. 352–355, Mar 2008. [87] Y. Miyawaki, H. Uchida, O. Yamashita, M.-A. Sato, Y. Morito, H. C. Tanabe, N. Sadato, and Y. Kamitani, “Visual image reconstruction from human brain activity using a combination of multiscale local image decoders,” Neuron, vol. 60, pp. 915–929, Dec. 2008. [88] T. Yoshida and K. Ohki, “Natural images are reliably represented by sparse and vari- able populations of neurons in visual cortex,” Nature Communications, vol. 11, p. 872, Feb 2020. [89] T. Horikawa and Y. Kamitani, “Generic decoding of seen and imagined objects using hierarchical visual features,” Nature Communications, vol. 8, p. 15037, May 2017. [90] G. Shen, T. Horikawa, K. Majima, and Y. Kamitani, “Deep image reconstruction from human brain activity,” PLOS Computational Biology, vol. 15, pp. 1–23, 01 2019. [91] R. Beliy, G. Gaziv, A. Hoogi, F. Strappini, T. Golan, and M. Irani, “From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri,” 2019. [92] G. Gaziv, R. Beliy, N. Granot, A. Hoogi, F. Strappini, T. Golan, and M. Irani, “Self- supervised natural image reconstruction and rich semantic classification from brain activity,” bioRxiv, 2020. [93] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” CoRR, vol. abs/1801.03924, 2018. [94] G. St-Yves and T. Naselaris, “Generative adversarial networks conditioned on brain activity reconstruct seen images,” in 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1054–1061, 2018. [95] T. Fang, Y. Qi, and G. Pan, “Reconstructing perceptive images from brain activ- ity by shape-semantic gan,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 13038–13048, Curran Associates, Inc., 2020. [96] G. Shen, K. Dwivedi, K. Majima, T. Horikawa, and Y. Kamitani, “End-to-end deep image reconstruction from human brain activity,” Frontiers in Computational Neuro- science, vol. 13, 2019. 133 [97] K. Seeliger, U. G¨ u¸ cl¨ u, L. Ambrogioni, Y. G¨ u¸ cl¨ ut¨ urk, and M. van Gerven, “Generative adversarial networks for reconstructing natural images from brain activity,” NeuroIm- age, vol. 181, pp. 775–785, 2018. [98] M. Mozafari, L. Reddy, and R. VanRullen, “Reconstructing natural scenes from fmri patterns using bigbigan,” CoRR, vol. abs/2001.11761, 2020. [99] K. Qiao, J. Chen, L. Wang, C. Zhang, L. Tong, and B. Yan, “Biggan-based bayesian re- construction of natural images from human brain activity,” CoRR, vol. abs/2003.06105, 2020. [100] R. VanRullen and L. Reddy, “Reconstructing faces from fmri patterns using deep generative neural networks,” Communications Biology, vol. 2, p. 193, May 2019. [101] Z. Ren, J. Li, X. Xue, X. Li, F. Yang, Z. Jiao, and X. Gao, “Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning,” NeuroImage, vol. 228, p. 117602, 2021. [102] A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [103] M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,” Trends Neurosci., vol. 15, pp. 20–25, Jan. 1992. [104] H. Wen, J. Shi, Y. Zhang, K.-H. Lu, J. Cao, and Z. Liu, “Neural encoding and decoding with deep learning for dynamic natural vision,” Cereb. Cortex, vol. 28, pp. 4136–4160, Dec. 2018. [105] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in 2nd Interna- tional Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. [106] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and ap- proximate inference in deep generative models,” in Proceedings of the 31st International Conference on Machine Learning (E. P. Xing and T. Jebara, eds.), vol. 32 of Proceed- ings of Machine Learning Research, (Bejing, China), pp. 1278–1286, PMLR, 22–24 Jun 2014. [107] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale im- age recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2015. [108] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. 134 [109] H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent advances in robot learning from demonstration,” Annual Review of Control, Robotics, and Au- tonomous Systems, vol. 3, no. 1, pp. 297–330, 2020. [110] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” in Pro- ceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, p. 4950–4957, AAAI Press, 2018. [111] S. Ross, G. J. Gordon, and J. A. Bagnell, “No-regret reductions for imitation learning and structured prediction,” CoRR, vol. abs/1011.0686, 2010. [112] B. D. Ziebart, Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, USA, 2010. [113] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in in Proc. 17th International Conf. on Machine Learning, pp. 663–670, Morgan Kaufmann, 2000. [114] D. Ramachandran and E. Amir, “Bayesian inverse reinforcement learning,” in Pro- ceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, (San Francisco, CA, USA), p. 2586–2591, Morgan Kaufmann Publishers Inc., 2007. [115] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, p. 1433–1438, AAAI Press, 2008. [116] J. V. Leeuwen, Warwick, A. R. Meyer, and M. Nival, Handbook of Theoretical Computer Science: Algorithms and Complexity. Cambridge, MA, USA: MIT Press, 1990. [117] C. Fritz, “Constructing b¨ uchi automata from linear temporal logic using simulation relations for alternating b¨ uchi automata,” in CIAA, 2003. [118] A. Donz´ e, “On signal temporal logic,” in Runtime Verification (A. Legay and S. Ben- salem, eds.), (Berlin, Heidelberg), pp. 382–383, Springer Berlin Heidelberg, 2013. [119] V. Raman, A. Donz´ e, D. Sadigh, R. M. Murray, and S. A. Seshia, “Reactive synthesis from signal temporal logic specifications,” in Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, HSCC ’15, (New York, NY, USA), p. 239–248, Association for Computing Machinery, 2015. [120] D. Kasenberg and M. Scheutz, “Interpretable apprenticship learning with temporal logic specifications,” CoRR, vol. abs/1710.10532, 2017. [121] F. Memarian, Z. Xu, B. Wu, M. Wen, and U. Topcu, “Active task-inference-guided deep inverse reinforcement learning,” in 2020 59th IEEE Conference on Decision and Control (CDC), pp. 1932–1938, 2020. [122] Q. Gao, M. Pajic, and M. M. Zavlanos, “Deep imitative reinforcement learning for temporal logic robot motion planning with noisy semantic observations,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8490–8496, 2020. 135 [123] M. Wen, I. Papusha, and U. Topcu, “Learning from demonstrations with high-level side information,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3055–3061, 2017. [124] W. Zhou and W. Li, “Safety-aware apprenticeship learning,” 2018. [125] P. Kyriakis, J. V. Deshmukh, and P. Bogdan, “Specification mining and robust de- sign under uncertainty: A stochastic temporal logic approach,” ACM Trans. Embed. Comput. Syst., vol. 18, Oct. 2019. [126] A. G. Puranic, J. V. Deshmukh, and S. Nikolaidis, “Learning from demonstrations using signal temporal logic,” CoRR, vol. abs/2102.07730, 2021. [127] M. Bloem and N. Bambos, “Infinite time horizon maximum causal entropy inverse reinforcement learning,” in 53rd IEEE Conference on Decision and Control, pp. 4911– 4916, 2014. [128] V. Raman, A. Donz´ e, M. Maasoumy, R. M. Murray, A. Sangiovanni-Vincentelli, and S. A. Seshia, “Model predictive control with signal temporal logic specifications,” in 53rd IEEE Conference on Decision and Control, pp. 81–87, 2014. [129] G. STILL, “Optimization problems with infinitely many constraints,” Buletinul ¸ stiint ¸ific al Universitatii Baia Mare, Seria B, Fascicola matematic˘ a-informatic˘ a, vol. 18, no. 2, pp. 343–354, 2002. [130] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [131] Y.-Y. Liu, J.-J. Slotine, and A.-L. Barabasi, “Controllability of complex networks,” Nature, vol. 473, pp. 167–73, 05 2011. [132] Ching-Tai Lin, “Structural controllability,” IEEE Transactions on Automatic Control, vol. 19, pp. 201–208, June 1974. [133] S. Pequito, S. Kar, and A. P. Aguiar, “Minimum cost input/output design for large- scale linear structural systems,” Automatica, vol. 68, pp. 384 – 391, 2016. [134] W.-X. Wang, X. Ni, Y.-C. Lai, and C. Grebogi, “Optimizing controllability of com- plex networks by minimum structural perturbations,” Physical review. E, Statistical, nonlinear, and soft matter physics, vol. 85, p. 026115, 02 2012. [135] S. Pequito, A. N. Khambhati, G. J. Pappas, D. D. ˇ siljak, D. Bassett, and B. Litt, “Structural analysis and design of dynamic-flow networks: Implications in the brain dynamics,” in 2016 American Control Conference (ACC), pp. 5758–5764, July 2016. [136] F. J. Mueller and A. Schuppert, “Few inputs can reprogram biological networks,” Nature, vol. 478, pp. E4–E4, 2011. [137] C.-T. Chen, Linear System Theory and Design. New York, NY, USA: Oxford Univer- sity Press, Inc., 2nd ed., 1995. 136 [138] G. Yan, J. Ren, Y.-C. Lai, C.-H. Lai, and B. Li, “Controlling complex networks: How much energy is needed?,” Physical Review Letters, vol. 108, 04 2012. [139] J. Sun and A. Motter, “Controllability transition and nonlocality in network control,” Physical Review Letters, vol. 110, 05 2013. [140] F. Pasqualetti, S. Zampieri, and F. Bullo, “Controllability metrics, limitations and algorithms for complex networks,” in 2014 American Control Conference, pp. 3287– 3292, June 2014. [141] T. Summers, F. L. Cortesi, and J. Lygeros, “On submodularity and controllability in complex dynamical networks,” IEEE Transactions on Control of Network Systems, vol. 3, 04 2014. [142] T. Summers and I. Shames, “Convex relaxations and gramian rank constraints for sensor and actuator selection in networks,” in 2016 IEEE International Symposium on Intelligent Control (ISIC), pp. 1–6, Sep. 2016. [143] V. Tzoumas, M. A. Rahimian, G. J. Pappas, and A. Jadbabaie, “Minimal actuator placement with bounds on control effort,” IEEE Transactions on Control of Network Systems, vol. 3, pp. 67–78, March 2016. [144] A. Olshevsky, “On (non)supermodularity of average control energy,” IEEE Transac- tions on Control of Network Systems, vol. 5, pp. 1177–1181, Sep. 2018. [145] R. Herrmann, Fractional Calculus – An Introduction for Physicists (3rd revised and extended Edition), World Scientific Publishing, Singapore, September 2018, 636 pp, ISBN: 978-981-3274-57-0. 09 2018. [146] G. Gupta, S. Pequito, and P. Bogdan, “Re-thinking eeg-based non-invasive brain in- terfaces: Modeling and analysis,” in Proceedings of the 9th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS ’18, (Piscataway, NJ, USA), pp. 275– 286, IEEE Press, 2018. [147] Y. Xue, S. Rodriguez, and P. Bogdan, “A spatio-temporal fractal model for a cps ap- proach to brain-machine-body interfaces,” in 2016 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 642–647, March 2016. [148] J. Sia, E. Jonckheere, L. Shalalfeh, and P. Bogdan, “Pmu change point detection of imminent voltage collapse and stealthy attacks,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 6812–6817, Dec 2018. [149] Y. Xue, S. Pequito, J. R. Coelho, P. Bogdan, and G. J. Pappas, “Minimum number of sensors to ensure observability of physiological systems: A case study,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1181–1188, Sep. 2016. 137 [150] A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschiatschek, “Guarantees for greedy maximization of non-submodular functions with applications,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 498–507, JMLR. org, 2017. [151] T. Summers and M. Kamgarpour, “Performance guarantees for greedy maximization of non-submodular set functions in systems and control,” arXiv e-prints, vol. 1712, 2017. [152] S. Khuller, A. Moss, and J. S. Naor, “The budgeted maximum coverage problem,” Information Processing Letters, vol. 70, no. 1, pp. 39 – 45, 1999. [153] Z. Zhang, B. Liu, Y. Wang, D. Xu, and D. Zhang, “Greedy algorithm for maximiza- tion of non-submodular functions subject to knapsack constraint,” in Computing and Combinatorics (D.-Z. Du, Z. Duan, and C. Tian, eds.), (Cham), pp. 651–662, Springer International Publishing, 2019. [154] P. Erd¨ os and A. R´ enyi, “On random graphs,” Publicationes Mathematicae Debrecen, vol. 6, p. 290, 1959. [155] A.-L. Barab´ asi and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999. [156] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [157] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for real-time atari game play using offline monte-carlo tree search planning,” in Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, eds.), vol. 27, Curran Associates, Inc., 2014. [158] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrit- twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–489, Jan 2016. [159] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester, “Challenges of real-world reinforcement learning: definitions, benchmarks and analysis,” Machine Learning, Apr 2021. [160] I. Y. Kim and O. L. de Weck, “Adaptive weighted sum method for multiobjective opti- mization: a new method for pareto front generation,” Structural and Multidisciplinary Optimization, vol. 31, pp. 105–116, Feb 2006. [161] T. W. Athan and P. Y. Papalambros, “A note on weighted criteria methods for com- promise solutions in multi-objective optimization,” Engineering Optimization, vol. 27, no. 2, pp. 155–176, 1996. 138 [162] P. L. Yu and G. Leitmann, “Compromise solutions, domination structures, and saluk- vadze’s solution,” Journal of Optimization Theory and Applications, vol. 13, pp. 362– 378, Mar 1974. [163] R. E. Steuer and E.-U. Choo, “An interactive weighted tchebycheff procedure for mul- tiple objective programming,” Mathematical Programming, vol. 26, pp. 326–344, Oct 1983. [164] F. Waltz, “An engineering approach: Hierarchical optimization criteria,” IEEE Trans- actions on Automatic Control, vol. 12, no. 2, pp. 179–180, 1967. [165] I. Das and J. E. Dennis, “A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems,” Structural optimization, vol. 14, pp. 63–69, Aug 1997. [166] A. Messac and A. Ismail-Yahaya, “Multiobjective robust design using physical pro- gramming,” Structural and Multidisciplinary Optimization, vol. 23, pp. 357–371, Jun 2002. [167] J. Fliege and B. F. Svaiter, “Steepest descent methods for multicriteria optimization,” Mathematical Methods of Operations Research, vol. 51, pp. 479–494, Aug 2000. [168] J.-A. D´ esid´ eri, “Multiple-gradient descent algorithm (mgda) for multiobjective opti- mization,” Comptes Rendus Mathematique, vol. 350, no. 5, pp. 313 – 318, 2012. [169] S. Sch¨ affler, R. Schultz, and K. Weinzierl, “Stochastic method for the solution of unconstrained vector optimization problems,” Journal of Optimization Theory and Applications, vol. 114, pp. 209–222, Jul 2002. [170] K. Van Moffaert, M. M. Drugan, and A. Now´ e, “Scalarized multi-objective reinforce- ment learning: Novel design techniques,” in 2013 IEEE Symposium on Adaptive Dy- namic Programming and Reinforcement Learning (ADPRL), pp. 191–199, 2013. [171] S. Natarajan and P. Tadepalli, “Dynamic preferences in multi-criteria reinforcement learning,” in Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, (New York, NY, USA), p. 601–608, Association for Computing Machinery, 2005. [172] A. Castelletti, G. Corani, A.-E. Rizzoli, R. Soncini-Sessa, and E. Weber, Reinforcement learning in the operational management of a water system, pp. 325–. 01 2002. [173] G. Tesauro, R. Das, H. Chan, J. Kephart, D. Levine, F. Rawson, and C. Lefurgy, “Man- aging power consumption and performance of computing systems using reinforcement learning,” in Advances in Neural Information Processing Systems (J. Platt, D. Koller, Y. Singer, and S. Roweis, eds.), vol. 20, Curran Associates, Inc., 2008. [174] Z. G´ abor, Z. Kalm´ ar, and C. Szepesv´ ari, “Multi-criteria reinforcement learning,” in Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, (San Francisco, CA, USA), p. 197–205, Morgan Kaufmann Publishers Inc., 1998. 139 [175] S. Mannor and N. Shimkin, “The steering approach for multi-criteria reinforcement learning,” in Advances in Neural Information Processing Systems (T. Dietterich, S. Becker, and Z. Ghahramani, eds.), vol. 14, MIT Press, 2002. [176] S. Mannor and N. Shimkin, “A geometric approach to multi-criterion reinforcement learning,” J. Mach. Learn. Res., vol. 5, p. 325–360, Dec. 2004. [177] H. Mossalam, Y. M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,” CoRR, vol. abs/1610.02707, 2016. [178] A. Abels, D. Roijers, T. Lenaerts, A. Now´ e, and D. Steckelmacher, “Dynamic weights in multi-objective deep reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 11–20, PMLR, 09–15 Jun 2019. [179] R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. [180] A. Abdolmaleki, S. Huang, L. Hasenclever, M. Neunert, F. Song, M. Zambelli, M. Mar- tins, N. Heess, R. Hadsell, and M. Riedmiller, “A distributional view on multi-objective policy optimization,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 11–22, PMLR, 13–18 Jul 2020. [181] L. Barrett and S. Narayanan, “Learning all optimal policies with multiple criteria,” in Proceedings of the 25th International Conference on Machine Learning, ICML ’08, (New York, NY, USA), p. 41–47, Association for Computing Machinery, 2008. [182] D. J. Lizotte, M. Bowling, and S. A. Murphy, “Linear fitted-q iteration with multiple reward functions,” Journal of Machine Learning Research, vol. 13, no. 105, pp. 3253– 3295, 2012. [183] D. J. Lizotte, M. Bowling, and S. A. Murphy, “Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, (Madison, WI, USA), p. 695–702, Omnipress, 2010. [184] A. Castelletti, F. Pianosi, and M. Restelli, “Tree-based fitted q-iteration for multi- objective markov decision problems,” in The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2012. [185] A. Castelletti, F. Pianosi, and M. Restelli, “Multi-objective fitted q-iteration: Pareto frontier approximation in one single run,” in 2011 International Conference on Net- working, Sensing and Control, pp. 260–265, 2011. 140 [186] S. Parisi, M. Pirotta, N. Smacchia, L. Bascetta, and M. Restelli, “Policy gradient approaches for multi-objective sequential decision making,” in 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2323–2330, 2014. [187] S. Parisi, M. Pirotta, and M. Restelli, “Multi-objective reinforcement learning through continuous pareto manifold approximation,” J. Artif. Int. Res., vol. 57, p. 187–227, Sept. 2016. [188] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of multi-objective sequential decision-making,” J. Artif. Int. Res., vol. 48, p. 67–113, Oct. 2013. [189] O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,” in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), vol. 31, Curran Associates, Inc., 2018. [190] H. Jiao and Y. Chen, “A global optimization algorithm for generalized quadratic pro- gramming,” Journal of Applied Mathematics, vol. 2013, p. 215312, Oct 2013. [191] S. Iwata, “Submodular function minimization,” Mathematical Programming, vol. 112, p. 45, Jan 2007. [192] S. Fujishige, H. Sato, and P. Zhan, “An algorithm for finding the minimum-norm point in the intersection of a convex polyhedron and a hyperplane,” Japan Journal of Industrial and Applied Mathematics, vol. 11, p. 245, Jun 1994. [193] P. Wolfe, “Finding the nearest point in a polytope,” Mathematical Programming, vol. 11, pp. 128–149, 1976. [194] S. Fujishige and P. Zhan, “A dual algorithm for finding the minimum-norm point in a polytope,” Journal of the Operations Research Society of Japan, vol. 33, 01 1990. [195] K. Sekitani and Y. Yamamoto, “A recursive algorithm for finding the minimum norm point in a polytope and a pair of closest points in two polytopes,” Mathematical Pro- gramming, vol. 61, pp. 233–249, Aug 1993. [196] J. A. De Loera, J. Haddock, and L. Rademacher, “The minimum euclidean-norm point in a convex polytope: Wolfe’s combinatorial algorithm is exponential,” in Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, (New York, NY, USA), p. 545–553, Association for Computing Machinery, 2018. [197] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist re- inforcement learning,” Machine Learning, vol. 8, pp. 229–256, May 1992. [198] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex optimization,” in Proceedings of the 30th International Conference on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceedings of Machine Learning Research, (At- lanta, Georgia, USA), pp. 427–435, PMLR, 17–19 Jun 2013. 141 [199] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Pro- cessing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´ e-Buc, E. B. Fox, and R. Gar- nett, eds.), pp. 688–699, 2019. [200] B. Amos and J. Z. Kolter, “OptNet: Differentiable optimization as a layer in neural networks,” in Proceedings of the 34th International Conference on Machine Learning (D. Precup and Y. W. Teh, eds.), vol. 70 of Proceedings of Machine Learning Research, pp. 136–145, PMLR, 06–11 Aug 2017. [201] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. [202] M. P. d. Carmo, Riemannian geometry. Birkh¨ auser, 1992. [203] P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker, “Empirical evalua- tion methods for multiobjective reinforcement learning algorithms,” Machine Learning, vol. 84, pp. 51–80, Jul 2011. 142 Appendices A Verification and Design under Uncertainty A.1 Timed Traces and Master Equation Timed traces are measurements that are obtained from the physical system under consid- eration, whereas a stochastic process is a mathematical model that summarizes the timed traces. Our objective is to map the traces to the random process that describes the system. For this purpose, we introduce a process that describes the random state transitions of a system in an n-dimensional Hilbert space. We assume the system is at state x(t)∈R n at time t. Subsequently, it waits dt seconds and then makes a transition to state x(t) +dx. The inter-event times dt and magnitude increment dx are random variables drawn from the joint PDF λ(x,t). Using these definitions, we enumerate all possible transitions that could have led to the system being at state x at time t and using simple probabilistic arguments we obtain the following generalized master equation (GME): p(x,t) = Ψ(t)δ(x) + Z t 0 dτ Z R n p(y,τ)λ(x− y,t−τ)dy (1) where p(x,t) is the probability of finding the system at state x at time t, δ(x) is the multivariate Dirac delta function and Ψ(t) is the probability of making no transitions until timet. The above equation provides the desired stochastic process that describes the system. In this paper, we construct a data-driven solution to the above GME using a black-box, 143 simulation-based method, whereby we assume access to computational model of the system, which we can stimulate and measure its output. We implement a method which relies on the ensemble statistics. Using the computational model we are able to obtain M timed- traces{(t 0 ,x i 0 ),..., (t n ,x i n )} M i=1 of the stochastic process described by Eq.1. Subsequently, an estimate of the time-dependant probability distribution p(x,t) is obtained by calculating the empirical PDF at each time instance; namely for each timet k we estimate the empirical PDF of the samples{x i k } M i=1 . Naturally, as M increases, so does the accuracy of the estimation. Obviously, this approach makes no assumption about the system dynamics or the type of stochasticity; given enough realizations, any complicated distribution can be estimated. B Hyperbolic Representation Learning B.1 Homology Homology is a general method for associating a chain of algebraic objects (e.g abelian groups) to other mathematical objects such as topological spaces. The construction begins with a chain group, C n , whose elements are the n-chains, which for a given complex are formal sums of the n-dimensional cells. The boundary homomorphism, ∂ n :C n →C n+1 , maps each n-chain to the sum of the (n− 1)-dimensional faces of its n-cell, which is a (n− 1)-chain. Combining the chain groups and the boundary maps in the sequence, we get the chain complex: ... ∂ n+2 −−−→C n+1 ∂ n+1 −−−→C n ∂n −→... ∂ 2 − →C 1 ∂ 1 − →C 0 ∂ 0 − → 0. (2) The kernels,Z n =ker(∂ n ), and the images,B n =im(∂ n+1 ), of the boundary homomorphisms are called cycle and boundary groups, respectively. A fundamental property of the boundary homomorphism is that its square is zero, ∂ n ∂ n+1 = 0, which implies that, for every n, the 144 boundary forms a subgroup of the cycle, i.e., im(∂ n+1 )⊆ker(∂ n ). Therefore, we can create their quotient group H n :=ker(∂ n ))/im(∂ n+1 ) =Z n /B n , (3) which is calledn-th homology group and its elements are called homology classes. The rank, β n = rank(H n ), of this group is known as the n-th Betti number. Even though homology can be applied to any topological space, we focus on simplicial homology, i.e., homology groups generated by simplicial complexes. Let K be a simplicial complex and K n its n-skeleton. The chain group C n (K) is the free abelian group whose generators are the n-dimensional simplexes of K. For a simplex σ = [x 0 ,...x n ]∈ K n , we define the boundary homomorphism as ∂ n (σ) = n X i=0 [x 0 ,... ,x i−1 ,x i+1 ,... ,x n ] (4) and linearly extend this to C n (K), i.e., ∂ n ( P σ i ) = P ∂(σ i ). B.2 Riemmanian Manifolds An m-dimensional manifold can be seen as a generalization of a 2D surface and is a space that can be locally approximated byR m . The space in which the manifold is embedded is called ambient space. We assume that the ambient space is the standard Euclidean space of dimension equal to the dimension of the manifold, i.e, R m . For x∈M, the tangent space T x M ofM at x is the linear vector space spanned by the m linear independent vectors tangent to the curves passing through x. A Riemannian metric g = (g x ) x∈M onM is a collection of inner productsg x :T x M×T x M→R varying smoothly acrossM and allows us to define geometric notions such as angles and the length of a curve. A Riemannian manifold is a smooth manifold equipped with a Riemannian metricg. The Riemannian metricg gives 145 rise to the metric normk·k g and induces global distances by integrating the length of a shortest path between two points x,y∈M d(x,y) = inf γ Z 1 0 q g γ(t) ( ˙ γ(t), ˙ γ(t))dt, (5) where γ∈M ∞ ([0, 1],H) is such that γ(0) =x and γ(1) =y. The unique smooth path γ of minimal length between two points x and y is called a geodesic and the underlying shortest path length is called geodesic distance. Geodesics and can be seen as the generalization of straight-lines. For a point x∈M and a vector v∈ T x M, let γ v be the geodesic starting from x with an initial tangent vector equal to v, i.e., γ v (0) = x and ˙ γ v (0) = v. The existence, uniqueness and differentiability of the exponential map are guaranteed by the classical existence and uniqueness theorem of ordinary differential equations and Gauss’s Lemma of Riemannian geometry [202]. B.3 Poincare Ball We define the M¨ obius addition for any x,y on the Poincare ball: x⊕y = (1 + 2hx,yi +kyk 2 )x + (1−kxk 2 )y 1 + 2hx,yi +kxk 2 kyk 2 (6) The M¨ obius addition is a manifold preserving operator that allows us to add point belonging in the Poincare ball without leaving the space. Finally, we provide the coordinate chart ψ :M→R m , i.e., the inverse of the parameterization in Eq. 3.8, which is defined as follows: x i =x 1 i−1 Y j=2 sin (y j ) cos (y i ), for i = 1,...,m− 1 and x m =x 1 m−1 Y j=2 sin (y j ) sin (y m ). (7) Note that we have omitted the learnable parameters. The coordinate chart is not explicitly needed for constructing the representation. However, it could be useful as an intermediate layer between our representation and standart deep neural network architecture. Its purpose 146 would be to transfer the representation from the manifold to a Euclidean space which can be fed into a standart DNN without having to define its operations on the manifold. B.4 Theoretical Result Proof of Theorem 3.1. LetD,E be two persistence diagrams and η :D→E bijection that achieves the infinum in Eq.3.5. Consider the subsetD 0 =D\R Δ , which is essentially the original diagram without the points in the diagonal. Then, we have the following sequence of inequalities d B (Φ(D,θ), Φ(E,θ)) (8) =d B exp x 0 X x∈D log x 0 φ(ρ(x)) , exp x 0 X x∈E log x 0 φ(ρ(x)) ! (9) ≤K e X x∈D log x 0 (φ(ρ(x)))− X x∈E log x 0 (φ(ρ(x)))) g (10) ≤K e X x∈D 0 log x 0 (φ(ρ(x)))− X x∈D 0 log x 0 (φ(ρ(η(x))) g (11) ≤K e X x∈D 0 log x 0 (φ(ρ(x)))− log x 0 (φ(ρ(η(x)))) g (12) ≤K e K l X x∈D 0 kφ(ρ(x))−φ(ρ(η(x)))k g ≤K e K l K φ X x∈D 0 kρ(x)−ρ(η(x))k g (13) ≤K e K l K φ K ρ X x∈D 0 kx−η(x)k g ≤K e K l K φ K ρ w g 1 (D 0 ,E) (14) wherew g 1 is the Wasserstein distance with theq-norm replaced with the norm induced by the metric tensor of the Poincare ball. Eq. 10 follows from the smoothness of the exponential map (assuming that it has Lipschitz constant equal toK e ). Eq. 11 follows by substituting the bijection η and because, as per requirements of Theorem 3.1, the auxiliary transformation ρ is zero on the diagonal R Δ and x 0 = 0. Note that the requirements of Theorem 3.1, combined with Eq. 3.8, 3.10 and 6, imply that the summation over x∈ R Δ collapses to zero. Eq. 12 follows from the triangle inequality. Eq. 13 follows from the smoothness of 147 the logarithmic map and the parameterization (assuming K l and K φ Lipschitz constants, respectively). Finally, Eq. 14 follows from the assumed, as per Theorem 3.1, smoothness of ρ and by using the definition of Wasserstein distance. This concludes the proof. B.5 Implementation Details Auxiliary Transformation: Even though several choices for the auxiliary transformation ρ :R 2 ∗ →R m defined in Eq. 3.7 are possible, we present the one used in our implementation. Our choice for ρ is based on the exponential structure elements introduced by Hofer et al. [68]. Formally, for i = 1, 2,...m, we define the following ρ i (x 1 ,x 2 ) = e −(x 1 −μ 1,i ) 2 −(ln (x 1 −x 2 )−μ 2,i ) 2 x 1 6=x 2 0 x 1 =x 2 . (15) Then, the auxiliary transformation is ρ : (x 1 ,x 2 )→ (ρ i (x 1 ,x 2 )) m i=1 , where (μ 1,i ,μ 2,i ) m i=1 are the means. In contrast to Hofer et al. [68], those means are not learned, but we rather fix them to pre-defined values and keep then constant during training. Additionally, we can easily prove that the above-definedρ satisfies the stability conditions given by Theorem 3.1, i.e., zero on the diagonal and Lipschitz continuity. Finally, under the assumption of all means being unique, it is injective onR 2 ∗ and, therefore, introduces no information loss. Network Architecture: We implemented all algorithms in TensorFlow 2.2 using the TDA-Toolkit 1 and the Scikit-TDA 2 for extracting persistence diagrams and run all experi- ments on the Google Cloud AI Platform. To show the versatility of our approach, we used the same network architecture across all of our simulations and our proposed representation acts as an input. In more detail, for each individual filtration that we extract from the data (images or graphs), we create a different input layer. The input to this layer is the persistence 1 https://github.com/giotto-ai/giotto-tda 2 https://github.com/scikit-tda/scikit-tda 148 diagrams of all homology dimension given by the corresponding filtration. Given the nature of the data (graphs and images), the homology dimensions with non-trivial persistence di- agrams are only the first two (i.e., H 0 and H 1 ). Our input layer processes the persistence diagrams of each homology class independently and outputs the representations (i.e., Eq. 3.11) for each one of them. Following that, we concatenate and flatten the previous outputs and fed the resulting vector to a batch normalization layer. Then, we use two dense layers of 256 and 128 neurons, respectively, with a Rectified Linear (ReLu) activation function. This is followed by a dropout layer to prevent overfitting and a final dense layer as an output. We consider two different implementations of the the above architecture. The first one is based on the hyperbolic neural networks introduced by Ganea et al. in [75]. It re-defines key neural network operations in the Poincare ball and, therefore, allows us to directly utilize the output of our representation in the neural network. The second implementation uses the coordinate chart given by Eq. 7 to project the representation back to a Euclidean space and feed it to a standardized implementation of the aforementioned layers. This method is more convenient in practice as it maintains compatibility with existing backbone networks trained in Euclidean geometry. Hyperparameters and Training: To train the neural network, we use the Adam optimizer (β 1 = 0.9,β 2 = 0.999) with an initial learning rate of 0.001 and batch size equal to 64. We use a random uniform initializer in the interval [−0.05, 0.05] for all learnable variables. For all graph datasets we trained the network for 100 epochs and halved the learning rate every 25 epochs. For the MNIST and Fashion-MNIST datasets we used 10 and 20 epochs, respectively, and no learning rate scheduler. We tune the dropout rate manually by starting from really low values and monitoring the validation set accuracy. In general, we noticed that the network did not tend to overfit, therefore, we kept the rate at low values (0-0.2) for all experiments. 149 C Multi-Objective Reinforcement Learning In the first part of the Appendix, we give the proof omitted in Section 7.4 as well as the analytical solution of Problem P A for the special case of 2-dimensional reward signals. For the ease of reference, we re-state the underlying theorem. C.1 Theoretical Result Theorem 3. Let l def = l(ω ∗ (θ)) be a generic loss function that depends explicitly on the the optimal solution ω ∗ (θ) of Problem P A . Then, the gradient of l with respect to the policy parameters is ∇ θ l = H T (θ) 0 0 G(θ) −I M e −D(μ ∗ (θ)) −D(ω ∗ (θ)) 0 e T 0 0 −T dl dω ∗ 0 0 , (16) where [G(θ)] m,n = 2g T m (θ)g n (θ), [H(θ)] m,· =− P n ω ∗ n (θ)∇ T θ (g T n (θ)g m (θ)), e and 0 are the identity and zero vectors of appropriate dimensions, I M is the identity matrix of dimension M and D(·) create a diagonal matrix from a vector. Proof. We start by writing the Lagrangian of Problem P A L(ω,λ,μ) = M X m=1 ω m g m (θ) 2 +λ M X m=1 ω m − 1 + M X m=1 μ m ω m , (17) 150 where g m (θ) =∇ θ E d π θ [r m (s,a)]. Let (ω ∗ (θ),λ ∗ (θ),μ ∗ (θ)) be the policy-dependant opti- mal primal and dual variables. The KKT conditions for stationarity, primal feasibility and complementary slackness are ∂L ∂ω ∗ m (θ) = 2 M X n=1 ω ∗ n (θ)g T n (θ)g m (θ) +λ ∗ (θ) +μ ∗ m (θ) = 0, m = 1, 2...M (18a) ∂L ∂λ ∗ (θ) =λ ∗ (θ) M X m=1 ω ∗ m (θ)− 1 = 0 (18b) ∂L ∂μ ∗ m (θ) =μ ∗ m (θ)ω ∗ (θ) = 0, m = 1, 2...M, (18c) By taking the gradient of the above equations with respect to the policy parameters, we obtain 2 M X n=1 ∇ θ ω ∗ n (θ)g T n (θ)g m (θ) + 2 M X n=1 ω ∗ n (θ)∇ θ g T n (θ)g m (θ) + ∇ θ λ ∗ (θ) +∇ θ μ ∗ m (θ) = 0, m = 1, 2...M, (19a) M X m=1 ∇ θ ω ∗ m (θ) = 0, (19b) ∇ θ μ ∗ m (θ)ω ∗ (θ) +μ ∗ m (θ)∇ θ ω ∗ (θ) = 0, m = 1, 2...M. (19c) Using matrix notation, we can rewrite the above system of equations as follows: G(θ) −I M e −D(μ ∗ (θ)) −D(ω ∗ (θ)) 0 e T 0 0 ∇ T θ ω ∗ (θ) ∇ T θ μ ∗ (θ) ∇ T θ λ ∗ (θ) = H(θ) 0 0 . (20) By inverting Eq. (20) and using the chain rule∇ θ l(ω(θ)) = ∂l ∂ω ∇ θ l we obtain the claimed result. 151 We observe that the solution of Eq. (20) requires knowledge of the optimal dual variable μ ∗ (θ), which is not directly available when we use the projected gradient descent solver of Al- gorithm 6; that solver only gives us the optimal primal variableω ∗ (θ). However, we can sub- stitute that optimalω ∗ (θ) into the KKT conditions given by Eq. eq1:kkt1,eq1:kkt2,eq1:kkt3 and solve the resulting linear system of M + 1 variables to find the optimal dual variables. The number of variables in that system depends only on the dimension of the reward signal M and not on the dimension of θ, therefore it adds minimal computational overhead. This step would not be necessary if we chose a primal-dual method for solving ProblemP A . How- ever, primal-dual methods may add to the complexity of the implementation, which may not be worthy given the simplicity of Problem P A . Exploring such a trade-off is an interesting direction for future work. C.2 Pareto Stationarity: 2-dimensional case In general, the problem of finding common ascent directions (Problem P A ) can be solved either exactly using the methods from the minimum norm points literature (as discussed in Sec. 7.3) or approximately using convex optimization methods. Due to the dimensionality of the gradient vector, the former class of methods are not applicable in our case. Therefore, we resorted to the projected gradient descent solver presented in Algorithm 6. This solver is essentially an internal optimizer which we run before every parameter update. Luckily, ProblemP A is convex and the gradient descent optimizer converges rapidly in practice with little tuning. Additionally, for the special case of two objectives, we can leverage the ana- lytical solution to achieve further performance improvements. We re-state Problem P A for convenience bellow: min 1 ,... M M X m=1 m ∇ θ E d π θ [r m (s,a)] 2 2 s.t M X m=1 m = 1, m ≥ 0, ∀m o . (P A ) 152 MOGW Envriroment DST Enviroment Reward Map SuperMarioBros2 Ant-v3 HalfCheetah-v3 Hopper-v3 Humanoid-v3 10 5 0 5 10 15 20 Figure 1: OpenAI Gym enviroments used in our experimental results. For M = 2, the problem reduces to min ∈[0,1] k∇ θ E d π θ [r 1 (s,a)] + (1−)∇ θ E d π θ [r 2 (s,a)]k 2 2 , which is a one-dimensional quadratic optimization problem of a single variable and has the following analytical solution: ∗ = (∇ θ E d π θ [r 2 (s,a)]−∇ θ E d π θ [r 1 (s,a)]) T ∇ θ E d π θ [r 2 (s,a)] k∇ θ E d π θ [r 2 (s,a)]−∇ θ E d π θ [r 1 (s,a)]k 2 2 + , (21) where [·] + represents the clipping operator, i.e., [x] + = max (min (x, 1), 0). Geometrically, Eq. (21) implies that minimal-norm point lies either on the boundary on the convex hull ( = 0 or = 1) and the desired descent direction is parallel to corresponding gradient or is equal to the vector perpendicular to the difference of the gradients. The former one is a corner case and occurs when the ascent direction for one of the objectives is a non-decreasing direction for the other one. While this is only applicable to M = 2, it enables us to replace the inner projected gradient descent optimizer in Algorithm 6 with the analytical solution in Eq. (21). Additionally, this special case is applicable to several real-world cases since there are several problems that have the reward-cost structure. C.3 Experimental Details Environments Our implementation uses modified versions of 4 OpenAI Gym enviroments (Fig. 1). The first two are synthetic domains and act as a proof-of-concept for our method, while the other two are complex, real domains. The details are listed below: 153 Multi-Objective GridWorld (MOGW): A multi-objective variant of the classic grid- world environment. The agent starts from the top-left corner, moves in a 16× 16-grid and receives a reward equal to 1 when it reaches the bottom-right corner. Each action results in a small time penalty of−0.1. Any policy that results in a trajectory that minimizes the taxicab distance between the agent’s starting position and the goal is Pareto optimal. Our implementation uses the gym-minigrid 3 enviroment, which we modify to output the aforementioned 2-dimensional reward. Deep Sea Treasure (DST): A classic multi-objective reinforcement learning environ- ment. The agent is located in a 10×11-grid and controls a submarine searching for treasures. There are multiple treasures the value of which increases as their distance increases. The agent receives a time penalty of−0.1 for each action. The Pareto front of non-dominated policies can be easily extracted [203] and is also shown on Fig. 7.1. As in the MOGW case, we use the gym-minigrid package and extend it to the reward map shown in Fig. 1. Multi-Objective SuperMario (MOSM): We use a modified, multi-objective variant of the popular video game. There are 5 scalar rewards: x-pos: the difference in Mario’s horizontal position between current and last time point, time: a small time penalty of−0.1, deaths: a large negative penalty of−100 each time Mario dies, coin: rewards for collecting coins, and enemy: rewards for eliminating an enemy. We use the gym-super-mario-bros 4 package and the multi-objective extension developed in [179]. Multi-Objective MuJoCo (MOMU): We focus on locomotion tasks and modify the environments provided by the MuJoCo physics simulator to output vector rewards. To do so, we create a custom wrapper for the built-in OpenAI Gym interface. There are 2 scalar rewards: x-pos: the difference in the robot’s current and last horizontal position, r-pos: a positive reward for moving in circular trajectories. 3 https://github.com/maximecb/gym-minigrid 4 https://github.com/Kautenja/gym-super-mario-bros 154 Implementation Details We implement our algorithm in PyTorch. Our implementation extends the Proximal Policy Optimization (PPO) algorithm to handle our PPA loss function (Eq. (7.8)). We use the clipped version of PPO with the clip ratio threshold set to 0.2. We set the GAE parameter to λ = 0.95 and the discount factor to γ = 0.99. Network Architecture: We use a consistent network architecture across all our simu- lation for both the actor and the critic: Two hidden layers of 64 and 64 neurons each and a ReLU non-linearity. In the discreet action case (MOGW, DST, MOSM), we append the actor network with an output layer of dimension equal to the number of actions. In the MuJoCo enviroments, where the actions are continous, we use that output layer to parametrize a multi-dimensional Gaussian distribution, via its mean and variance, assuming uncorrelated actions. Each non-terminal layer has a ReLU non-linearity. Even though the hyperbolic tangent is known to be preferred for reinforcement learning applications, we experimentally observed that it leads to vanishing gradients and slows down learning. This may be due to the fact that the PPA loss function requires second-order differentiation. In the SuperMario enviroment, we worked on the pixel domain and two convolutional layer of 32 and 64 layers as feature extractors, which were shared between the actor and the critic. Hyperparameters and Training: To train the neural network, we use the Adam optimizer (β 1 = 0.9,β 2 = 0.999) with a learning rate of 0.0001. In the MOGW, DST and MOMU domains, we optimize the parameters of the actor and critic separately. In the MOSM domain, given that the actor and critic share the convolutional layers, we sum their losses (using weights 1 and 0.5, respectively) and use a common optimizer. We run all of our simulations in the Google Cloud Platorm using 48 vCores and one NVIDIA Tesla T4 GPU. For the MOGW and DST environments, we train the agent for 0.5M frames and for the MOSM and MOMU for 1M frames. Each 512 frames we perform one update of the network parameters iterating over 10 epochs of the collected data and using a mini-batch size of 64. 155
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Theoretical foundations for modeling, analysis and optimization of cyber-physical-human systems
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Learning logical abstractions from sequential data
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Dealing with unknown unknowns
PDF
Assume-guarantee contracts for assured cyber-physical system design under uncertainty
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Differential verification of deep neural networks
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Personalized driver assistance systems based on driver/vehicle models
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Distributed adaptive control with application to heating, ventilation and air-conditioning systems
PDF
Optimization strategies for robustness and fairness
PDF
High-accuracy adaptive vibrational control for uncertain systems
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Improving mobility in urban environments using intelligent transportation technologies
PDF
Learning and control in decentralized stochastic systems
PDF
Provable reinforcement learning for constrained and multi-agent control systems
Asset Metadata
Creator
Kyriakis, Panagiotis
(author)
Core Title
Verification, learning and control in cyber-physical systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
07/19/2022
Defense Date
06/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,Control,cyber-physical systems,formal verification,machine learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bogdan, Paul (
committee chair
), Deshmukh, Jyotirmoy (
committee member
), Ioannou, Petros (
committee member
)
Creator Email
pkyriaki@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111373228
Unique identifier
UC111373228
Legacy Identifier
etd-KyriakisPa-10866
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kyriakis, Panagiotis
Type
texts
Source
20220719-usctheses-batch-956
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
cyber-physical systems
formal verification
machine learning