Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Context-adaptive expandable-compact POMDPs for engineering complex systems
(USC Thesis Other)
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Context-Adaptive Expandable-Compact POMDPs for Engineering Complex Systems by Parisa Pouya A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ASTRONAUTICAL ENGINEERING) December 2022 Copyright 2022 Parisa Pouya Dedication To My parents Nader and Rahimeh My sister Mahsa My supporters Firouz and Sue And my best friend and better half Aslan ... ii Acknowledgements I would like to express my sincere gratitude and appreciation to each and every one who so gener- ously contributed to the work presented in this thesis. First and foremost, I would like to especially thank my advisor, Professor Azad M. Madni for the continuous support and guidance during my Ph.D. studies and research, and for his patience, enthusiasm, and motivation. He generously offered to help me during my most desperate times when I was looking for a PhD position. The day that he kindly invited me to his office and in- terviewed with me changed my life for ever. I feel extremely blessed to have the opportunity of working as part of his research team at the Astronautical Engineering department and for having him as my advisor and mentor. Beside my Ph.D. advisor, my profound gratitude goes to Professor Daniel Erwin for his enor- mous support and guidance throughout my entire PhD journey. Also, I would like to thank Profes- sor James Moore for his enlightening feedback and advice. I would like to acknowledge my colleagues and friends in the Astronautical Engineering depart- ment and Systems Architecting and Engineering Program. Special thanks to Dr. Ayesha Madni, Dr. Michael Sievers, Kenneth (Ken) Cureton, Ma Radela (Dell) Cuason, Linda Ly, Luis Saballos, Marlyn Lat, Edwin Ordoukhanian, and Shatad Purohit for the fun, hours-long discussions, and friendship over the years. Last but not least, I would like to express my deepest gratitude to my mother Rahimeh Piradl, my father Nader Pouya, and my sister Mahsa Pouya for their tremendous support and encourage- ment at every step of my life. Also, special thanks to Professor Firouz Pouya and Sue Pouya for their generosity and support. Finally, I would like to express my greatest appreciation and thank iii you to Dr. Aslan Etminan for his emotional support, sacrifice, patience, and guidance. Without their unconditional love and support, I would not have gone this far in pursuing my dreams and exploring the beauty of life. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Overview of previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 POMDP Modeling and Formulation . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Policy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.3 Adaptation and Refinement in POMDPs . . . . . . . . . . . . . . . . . . . 17 1.3 Research Objectives and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4 Methodological Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 2: POMDP Modeling and Definition . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1 POMDP Theory and Mathematical Formulation . . . . . . . . . . . . . . . . . . . 28 2.2 Existing POMDP Formulations and Applications . . . . . . . . . . . . . . . . . . 30 2.3 Expandable-Compact POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Developing Scalable, Adaptive POMDPs for Safety-Critical Application of A Vs . . 43 2.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 3: Policy Estimation in POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v 3.1 Theory and Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Policy Estimation Using Offline and Online Techniques and Algorithms . . . . . . 53 3.2.1 Offline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.2 Online Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 N-Step Look-Ahead, An Online, Adaptive, Context-Based Policy Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Policy Estimation for Lane Keeping and Lane-Changing . . . . . . . . . . . . . . 69 3.4.1 Offline versus Online Policies for Safety-Critical Application of A Vs . . . 72 3.5 N-Step Look-Ahead for Policy Estimation in Grid-Based POMDPs: An Exemplar Active Search Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Chapter 4: Adaptation and Refinement in POMDPs . . . . . . . . . . . . . . . . . . . . . 90 4.1 Overview of Existing Adaptation Techniques . . . . . . . . . . . . . . . . . . . . 91 4.2 Transfer Learning and Adaptation Literature Review and Related Work . . . . . . 93 4.2.1 Policy Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2.2 Adaptation via Model/Data Augmentation . . . . . . . . . . . . . . . . . . 96 4.3 Adaptation via Expansion and Post-Expansion Refinement . . . . . . . . . . . . . 97 4.3.1 Online Adaptation via Expansion . . . . . . . . . . . . . . . . . . . . . . 99 4.3.2 Post-Expansion Refinement . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4 Adapting and Refining Lane-Keeping and Lane-Changing Models . . . . . . . . . 109 4.4.1 Adaptation and Refinement in Lane-Keeping Model . . . . . . . . . . . . 109 4.4.2 Adaptation and refinement in Lane-Changing Model . . . . . . . . . . . . 123 4.5 Policy Transfer in Expandable-Compact POMDP Models . . . . . . . . . . . . . . 129 4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Chapter 5: Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.1 Impact on MBSE and Systems Engineering Society . . . . . . . . . . . . . . . . . 147 vi Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.1 Belief-Space Approximation Pseudo-code . . . . . . . . . . . . . . . . . . 158 A.2 Adaptation via Expansion Pseudo-code . . . . . . . . . . . . . . . . . . . 160 A.3 Example of Verifying Newly Added Actions with Respect To Physical Constraints and Requirements . . . . . . . . . . . . . . . . . . . . . . . . 161 vii List of Tables 3.1 Total values and online policies computed with N = 4 . . . . . . . . . . . . . . . . 77 3.2 Estimated Q-values for randomly selected beliefs from lane-keeping model . . . . 78 3.3 Estimated look-ahead for randomly selected beliefs from lane-keeping model (N = 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4 FPs (a h ) and FNs (b h ) probabilities in observation model . . . . . . . . . . . . . . 82 4.1 Comparison of policies estimated during online adaptation via expansion (Update criterion enabled vs. discarded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2 Summary of performance statistics (TTC, Robustness, Policy Estimation Time) for models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.3 Analysis of belief probabilities and probability of failure for original lane-keeping model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.4 Estimated Q-values during transfer for target beliefs with high probability assigned to state s 0 5 : Lane changing initiated, unsafe from lane-changing model . . . . . . . 134 viii List of Figures 1.1 Sense-Plan-Act cycle for continuous planning and decision making in autonomous, complex systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Overall POMDP model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 An example of a grid-based state-space representation in a search and rescue problem; (A) shows the search and rescue environment and (B) presents the grid-map laid out in the environment. The small blue squares identify distinct states 11 1.4 Example of a belief tree constructed during the planning/estimation phase . . . . . 16 1.5 Offline vs online algorithms and techniques . . . . . . . . . . . . . . . . . . . . . 17 1.6 An example of a mapping function defined for transfer learning between 2 environments with the same goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1 Example of defining compact, high-level states by finding patterns/clusters (B) of combinations of state variables with same behavioral patterns (A) . . . . . . . . . 38 2.2 Example of a Compact POMDP model where states are defined based on clusters of observations with similar distributions . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 POMDP model with 3 states, 2 actions, and 3 observations. Upper right section of diagram shows probabilistic state estimation based on pre-known states; Lower right section shows a possible missing state resulting from unknown observations . 40 2.4 Example new hidden state initialization (red cluster) after unknown-unknowns are identified based on distances to previously known states . . . . . . . . . . . . . . . 42 2.5 Overview of the simulated multi-lane freeway environment in Python VTK. A V (green rectangle) is initially parked in the parking lane . . . . . . . . . . . . . . . 44 ix 2.6 Sample traffic distribution for lane changing shown using Gaussian distributions ((rd d d V ) not shown) at left and heat-map of traffic density at right . . . . . . . . . 45 2.7 Individual clusters/classes of observations obtained from simulation represent states for the lane-keeping model . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Overview of continuous decision-making and planning using POMDPs; h is a normalization factor such thatS s2S b t+1 (s)= 1 . . . . . . . . . . . . . . . . . . . . 51 3.2 Visual comparison of policy estimation and execution using offline (top) and online algorithms (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Pruning the belief tree by heuristics and clustering technique. Pruned branches are identified by red cross marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Average runtime of N-Step Look-Ahead (green) compared to look-ahead search with different filters for various POMDP models. Runtimes larger than 300 seconds are replaced with 300 and the plots are smoothed for better graphical representation (Hardware specs: Core i5 CPU @1.7GHz, RAM 6, Ubuntu 16.04) . 68 3.5 Evaluating and comparing decision-making and planning for lane-keeping and lane-changing models using POMDP and N-Step Look-Ahead, end-to-end planning, and rule-based decision-making. X-axis identifies the time-steps in simulation and Y-axis represents the decisions (a 0 : maintain status quo, a 1 : speed up, a 2 : slow down, and a 3 : change lanes). Topmost section presents the belief probability associated with the s 0 : Safe to change lanes state2 S LC . . . . . . . . . 70 3.6 Expanded belief of the lane-keeping model (left) and normalized belief-action values estimated by N-Step Look-Ahead (right) . . . . . . . . . . . . . . . . . . . 71 3.7 Flow chart of the customized Q-learning algorithm . . . . . . . . . . . . . . . . . 75 3.8 Long-term sum of rewards collected using Q-learning over 1500 episodes for lane-changing (A) and lane-keeping (B) models . . . . . . . . . . . . . . . . . . . 76 x 3.9 Q-values associated with 4 randomly selected beliefs for the lane-changing model. X-axis in these plots show the number of episodes and y-axis represents the normalized value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.10 A 2D map of a region including flow channels is selected, a search area is selected and the state-space is defined by gridifying the selected area (10 10 = 100 cells/states); A POA map is identified that assigns high probability (p = 0.03125) to regions with flow channels and low probability (p = 0.00595) to other locations . 80 3.11 Camera model of cell coverage at h = 4 meters . . . . . . . . . . . . . . . . . . . . 81 3.12 Results of Evaluating N-Step Look-Ahead runtime using different N values, parallel processing, and reachable observations . . . . . . . . . . . . . . . . . . . 84 3.13 Total number of search actions resulting from following the search policies estimated using N-Step Look-Ahead algorithm with different N values . . . . . . . 85 3.14 Reconstructed map of the area and the heat-map associated with exploration and exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.1 Overview of transfer learning framework to transfer knowledge from source task and environment to target task and environment by developing a mapping function to identify efficient value and knowledge transfer for adaptation . . . . . . . . . . 93 4.2 Summary of existing TL techniques and metrics in POMDPs . . . . . . . . . . . . 95 4.3 Overview of phases 1 and 2 of proposed adaptation and refinement technique. simF shows a similarity function, S sim presents a subset of states with high similarity to the unknown observation, where each state is assigned with a similarity weight simWeights, and M shows the Expandable-Compact POMDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Exemplar comparison of model size and policy estimation time (using N-Step Look-Ahead) when no labeling is employed (purple) vs. when labeling is used to reduce the size and complexity of model (blue) . . . . . . . . . . . . . . . . . . . 102 xi 4.5 Example of belief decay as the cluster of newly added state H 0 gets populated and dynamics get updated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.6 Flow chart of the Adaptation via Expansion technique. o 0 t is the new unknown observation obtained at time-step t, S sim shows the similar states and SimDict is the hash table employed for storing the labels and numbers of observations associated with newly added states . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7 Example of how POMDP changes during online adaptation via expansion (H 0 and H 1 their observations are added) and after post-expansion refinements (s 0 and s 3 and their observations are joined and new action (blue) added) . . . . . . . . . . . 108 4.8 Distribution of observation clusters overDd andDv . . . . . . . . . . . . . . . . . 111 4.9 Example of how beliefs of s 1 and s 2 decay as new observations are used for updating the dynamics of H 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.10 New observations identified during adaptation: left and right show the original and new observation-space, respectively including clusters associated with new, previously unseen observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.11 Range and distribution of estimated reward values for states associated with the pink and light green clusters (new states) . . . . . . . . . . . . . . . . . . . . . . . 113 4.12 TTC measures resulting from policy estimation using N-Step Look-Ahead during online adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.13 Comparison of rate of changes in TTC vs. A V’s velocity where A V’s velocity is controlled by the policies estimated during adaptation . . . . . . . . . . . . . . . . 115 4.14 Comparing performance of refined POMDPs and expanded (online adaptation) to transferred and rule-based techniques . . . . . . . . . . . . . . . . . . . . . . . . 117 4.15 Overview of steps for performing probabilistic, model-based comparison based on expected and estimated failure probabilities from models . . . . . . . . . . . . 119 4.16 Estimation of probability of transitioning to failure state based on belief-spaces . . 122 xii 4.17 Distribution of min-max probabilities of transitioning to failure states for the original and refined lane-keeping models . . . . . . . . . . . . . . . . . . . . . . . 122 4.18 Results of data analysis performed on data collected during adaptation via expansion in lane-changing model . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.19 Overview of integrating adapted and refined lane-keeping and lane-changing models with a risky, unsafe, complex environment and use-case scenario simulated in CARLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.20 Example of obtained behavioral results from CARLA-POMDP integration, A V is labeled as Ego in this figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.21 State and action mapping functions for the lane-keeping (up) and lane-changing (bottom) models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.22 Long-term sum of rewards in transfer vs. no transfer for the lane-keeping (left) and lane-changing (right) models. Value of transfer is shown in terms of collected total reward after transfer, asymptotic performance, and jumpstart . . . . . . . . . 134 4.23 Estimated Q-values during transfer for target beliefs with high probability assigned to state s 0 3 : About to crash from lane-keeping model . . . . . . . . . . . . . . . . . 135 xiii Abstract Complex systems, such as Autonomous Vehicles (A Vs), usually operate in uncertain and reactive environments, where they are exposed to variety of noisy and partially available information from the environment. The dynamic and reactive nature of these environments require the systems to continuously adapt and respond with a plan or decision while accounting for the uncertainty and partially available information and adhering to constraints posed by the system and environment. This implies that when designing models for capturing the system-environment interactions (i.e., behavior of the system in its environment), deterministic approaches and traditional, static model- ing techniques may not be sufficient to account for probabilistic behavior of complex systems in the environment. Probabilistic models and techniques, such as Partially Observable Markov Decision Processes (POMDPS) have been successfully employed for modeling system-environment interactions in presence of partial observability and uncertainty. POMDPs (a reinforcement learning technique - a subset of machine learning problems) are state-based models, in which the non-deterministic transitions in-between states and partial observations are addressed using probability distributions (so-called transition and emission functions and probabilities). These models are typically de- signed with respect to a scenario and a goal in an environment, where the goal is modeled using a reward function. Due to partial observability, the state of the model is continuously inferred and updated probabilistically using Bayesian techniques. The overall goal in a POMDP model is to find a mapping between the probabilistic states (so-called belief) and available plans/actions (so-called policy and policy estimation) that maximizes the collected rewards over a time-horizon, so that the system can optimally react to observations from the environment and achieve the goal xiv while adhering to constraints posed by the system and environment. Although POMDP models are shown to be successful for planning and decision-making based on probabilistic behavior of complex systems, gaps and limitations remain. For instance, POMDP models usually suffer from scalability issues due to employing large state-spaces and because of the probabilistic nature of the policy estimation problem. In addition, these models are usually designed with respect to a subset of limited information available at the outset. This implies that model can be exposed to new, previously unseen observations, either due to missing information or changes in the environment, which affects the performance of state inference and decision-making (e.g., state is inferred incorrectly, and a risky decision is made). There exists substantial amount of research work and proposed techniques to address the scalability issues by only optimizing pol- icy estimation techniques to achieve near real-time performance. Examples of such techniques are heuristic algorithms, guided search, information pruning, and using a priori information. How- ever, these techniques only target the scalability issues associated with policy estimation (not model scalability) and are typically implemented to work for POMDP models with specific structures and formulations. To account for new information in model, various techniques, such as: 1- modeling all possible changes and 2- transfer knowledge are employed. The first technique models each parameter that may change as a “state variable” of a new, enlarged POMDP model and estimates policy with regards to all possible changes. However, this technique suffers from scalability issues. On the other hand, transfer knowledge uses the information and experience gained from perform- ing one task in an environment for accomplishing different but similar tasks/goals in a similar environment. Although transfer-based techniques have shown promising results for determinis- tic problems, there is only limited work for transfer in POMDPs. Moreover, majority of transfer knowledge techniques require the differences and similarities between environments/tasks to be known as a priori, which is a strong assumption and may not be applicable in real-world problem- domains, mainly due to presence of unknown-unknowns (information that we don’t know that we don’t know). In the light of foregoing, this thesis presents a novel modeling technique based on POMDPs xv that employs heuristics, machine learning, and data analysis techniques to: 1- account for scala- bility in both model design and policy estimation to achieve near real-time performance, and 2- gradually adapt and refine a designed POMDP model and estimated policy with respect to new, unknown information as it reveals from the environment. The first chapter in this thesis provides the background, motivation, research goals, and hypothesis. An overview of previous works asso- ciated with POMDP modeling, policy estimation, and adaption is also provided in this chapter. In the second chapter, a detailed discussion and overview of existing POMDP modeling techniques is provided. In addition, the ”Expandable-Compact POMDP” modeling technique is presented and discussed. To account for scalability in the POMDP model design, states are defined to represent clusters/patterns of similar events obtained from the environment (instead of using distinct state variables and datapoints as states), where both desired (goal) and failure events are also modeled as states. Using distinct clusters to represent states leads to a compact model design and drastically reduces the size and complexity of the model. Moreover, it allows for employing data analysis techniques on available clusters to identify similar states, observations, which are later used for developing heuristics for policy estimation, adaptation, and refinement. Two different compact POMDPs are designed for lane-keeping and lane-changing (safety-critical applications of A Vs) in a multi-lane freeway environment to perform safe and collision decision-making. In the next chapter (chapter 3), an overview of existing policy estimation algorithms and techniques, including the employed heuristics, information pruning methods, and available a priori information are pre- sented and discussed. To address the scalability issues associated with policy estimation, an online, adaptive policy estimation algorithm is implemented and presented that employs data and model- driven heuristics to perform a guided search and prune unnecessary search directions (sub-optimal search directions) in addition to parallel processing to achieve near real-time performance. The performance (in terms of computation time and policy optimality) of this algorithm is tested and verified by comparing to policies obtained using a benchmark policy estimation algorithm and a deep learning model implemented using neural networks. Chapter 4 presents a detailed discussion on available adaptation via transfer learning techniques and provides a novel, 2-phase adaptation xvi and refinement algorithm and technique for the Expandable-Compact POMDPs. This technique has two phases: 1) online adaptation via expansion and 2) offline post-expansion refinement. In the initial phase, new, previously unseen information is collected and accounted for by gradually adding new states and observations to the model and using collected data to estimate and update underlying dynamics based on a similarity metric. In the offline phase (post-expansion), the per- formance of the adapted model and collected data are analyzed using data analysis techniques to identify possible inaccuracies and make further refinements. This technique and proposed algo- rithm are employed for adapting the designed POMDPs in chapter 2 to more complex and riskier environment, where risky behaviors such as cutting off to other lanes, sudden stopping on the lane, closing the gap during lane-changing are expected from other vehicle agents in the environment. The performance of this technique and algorithm is compared to a policy transfer technique and the results are discussed. Finally, to evaluate and verify the proposed modeling technique (including compact and expandable POMDP model design, policy estimation, adaptation and refinement), the adapted models and the policy estimation algorithm are integrated with CARLA simulation envi- ronment and results are discussed. The final chapter in this thesis (chapter 5) provides a summary, conclusion and discusses the future directions in this research. xvii Chapter 1 Introduction As the systems continue to advance with the technology, their size (e.g., number of sub-systems and components) and complexity (e.g., resulting from numerous internal and external interactions) grows rapidly. With growing complexity and size of systems, the development of scalable and adaptive modeling techniques and tools have become a necessity. Autonomous Vehicles (A Vs) and Unmanned Aerial Vehicles (UA Vs) are examples of complex systems that often operate in un- certain, partially observable, dynamic, and reactive environments, where they receive information from environment using sensors and process the information to determine an optimal response, while adhering to constraints (e.g., computation power and time) posed by the system and envi- ronment [1, 2, 3, 4, 5, 6]. Partial observability implies that the system cannot directly sense all salient aspects of its environment, where uncertainty means the impact of system actions in the environment are non-deterministic [7]. These systems typically employ models of the environ- ment and their own system and health states, based on which planning and decision making is done in the reactive dynamic environment [3, 8, 9, 10]. Partial observability and uncertainty imply probabilistic complex system behavior [11], which needs to be accounted for when modeling the system-environment interactions (i.e., behavior of system in a reactive environment). In general, the accuracy and performance of such system models depend on understanding the behavior of the system from its interactions with the environment. Behavior of complex systems can be identi- fied as a result of combinations of known and unknown variables (i.e. system and/or environment 1 states) and the correlations and dependencies between them, which are obtained from system in- teracting with its environment. For this purpose, various modeling techniques can be employed, or entirely new modeling techniques can be developed that support the existing system complexity and address existing challenges [12, 13, 14]. Recent studies on modeling system dynamic behavior can be classified into two high-level categories: 1) Inferring the underlying model and parameters for classifying and making predictions about the future; and 2) Identifying the conditional depen- dencies, relationships between variables, correlations, and changes in variables over time [15]. The main objective of the former class of problems is to find patterns and parameters from the behavioral data and make predictions about the future. The latter class of problems focuses on identifying the underlying (system or environment) states and correlations resulting from a sys- tem’s interaction with its environment. The main goal in these problems is to design an abstract representation (model) of the real system-environment interaction and use it for various purposes, such as understanding and reasoning about the system, inference, or planning and decision mak- ing [14]. Examples of such techniques and methods are time-varying Gaussian graphical models [15, 16], dynamic Bayesian Networks (DBNs) [15], and Markovian models, including Hidden Markov Models (HMMs) [17], Markov Decision Processes (MDPs) [18], and Partially Observable Markov Decision Processes [19]. Typically, MDPs and POMDPs (reinforcement learning techniques) are state-based modeling techniques that are popularly employed for addressing planning and decision-making in reactive and dynamic environments. Since the actions and decisions are embedded within the model con- struct, the relations between the states of the model depend on the performed action (e.g., in con- trast with the HMMs where transition between states only depend on the observation, the state transitions in MDPs and POMDPs are triggered based on actions defined within the model). While MDPs are limited to decision-making in deterministic environments with no partial observability, POMDPs, extension of MDPs to uncertain and probabilistic domains, are specifically employed for planning in uncertain and partially observable environments [20, 21, 22, 23, 24, 25, 26, 27]. 2 Some examples of planning and decision-making using POMDP models are lane changing, in- tersection crossing, unprotected left-turn, obstacle avoidance (in applications of A Vs) and path planning, navigation, search and rescue (in applications of UA Vs and quadcopters). These models are usually designed with respect to a specific goal (e.g., collision avoidance and smooth lane- changing) within an environment that is formulated using a reward function and the overall goal in such models is to find an optimal mapping between states and actions that maximizes the reward over a time-horizon. Finding an optimal mapping is usually referred to as ”policy estimation,” which is usually performed by optimizing a ”value function” whose value depends on the reward function and model dynamics [19]. POMDP models (and MDPs in general) are usually designed with respect to limited infor- mation (i.e., information representing the behavior of system in an environment) available at the outset, where only a subset of observations determining system-environment interactions are ad- dressed within the model (partial models). Moreover, POMDPs suffer from various scalability issues due to their probabilistic nature [28]. The research work presented in this thesis focuses on addressing the scalability and adaptability issues and challenges associated with POMDPs and pro- vides a novel, probabilistic, scalable, and adaptable modeling technique that enables: 1- achieving near real-time performance in real-world problem-domains and 2- accounting for new information as it becomes available from the environment. Specifically, ”Expandable-Compact POMDP” mod- eling technique is presented in this thesis that uses unsupervised/supervised learning (e.g., cluster- ing/classification) to account for scalability in POMDP model design; ”N-Step Look-Ahead” on- line, policy estimation algorithm is implemented that uses heuristics and data analysis techniques to achieve near real-time performance; and an adaptation algorithm and refinement technique are presented that enable gradually accounting for new information as it becomes available from the environment. The remainder of this chapter is organized as follows. Section 1.1 provides the background and motivation for this research. Section 1.2 presents overview of previous, POMDP-related work 3 where limitations, gaps, and challenges are discussed (including POMDP modeling and formula- tion, policy estimation, and adaptation and refinement). Section 1.3 provides the research goals, objectives and presents the research hypothesis. Finally, section 1.4 provides the thesis overview. 1.1 Background and Motivation In general, complex systems, when interacting with their environment, typically operate in a con- tinuous ”sense-plan-act” cycle (figure 1.1), where the system initially receives an observation from environment by interacting with it, then, the system processes the information and plans accord- ingly, and finally the plan gets executed in the environment. As shown in figure 1.1, planning with Figure 1.1: Sense-Plan-Act cycle for continuous planning and decision making in autonomous, complex systems respect to observations can be performed in various levels: 1- Strategic level, 2- Tactical level, and 3- Operational level. Strategic level, the highest planning level, focuses on finding high-level, general plans to fulfill the system’s objective over a long-time horizon. Various approaches, such as map-based techniques and A-star graph search algorithms can be employed for this purpose [29, 30, 31, 32]. Tactical decision making is performed at the second level in the three-level plan- ning hierarchy and is responsible for providing detailed and continuous guidance and maneuver 4 plans to the system. Various techniques, such as finite state machines (FSMs), Markovian Deci- sion Process models (MDPs), and POMDPs have been employed for this purpose [20, 21, 33]. In general, the use of these methods is to model events and interdependencies and select the best tactic or plan with respect to the ongoing situation. For instance, some researchers have devel- oped POMDP models for A V planning with the goal of lane changing in a partially visible road environment [20, 21]. In another example, researchers have used DBNs for traffic state estimation and modeled the problem of continuous planning using MDPs [34]. Operational level, the lowest level in the hierarchy includes necessary tasks to translate the maneuvers from tactical plans into motion-related actions that can be executed by the system. To address this problem, the majority of the research focuses on developing end-to-end methods that map data (e.g., images) to control commands [35, 36, 37, 38, 39, 40]. On the other hand, other researchers have employed mathemat- ical modeling, data-driven dynamics modeling and optimization for motion planning (e.g. [41]). Majority of the research work in the area of planning and decision-making formalize the prob- lem using POMDPs. In the existing related-work, POMDP models are employed for a combination of tactical and motion planning, where motion-related equations and models are employed to for- mulate state transitions, and actions represent control commands (mainly acceleration/deceleration rates). In the A V application domain, researchers have used POMDPs to perform planning and decision-making in various use-case scenarios and environments while accounting for different sources of uncertainty and partial observability. For instance, researchers in [20] and [21] devel- oped POMDP models for planning and decision-making for a lane-changing scenario while ac- counting for inevitable sensor noise as the source of uncertainty. Other researchers used POMDP models for decision-making in intersections while accounting for hidden intentions of other drivers as source of partial observability and uncertainty [22]. In this example, motion models are em- ployed for behavior modeling and the decisions are associated with different acceleration rates on a pre-determined path in the intersection. Similarly, researchers in [23] formalize the problem of autonomous driving decision-making in an uncontrolled intersection using a POMDP model, 5 where the unclear motion intentions and perception noise are accounted for as sources of uncer- tainty In another example, researchers formalize situation-aware autonomous decision-making in urban roads using POMDPs for tactical planning where the POMDP decisions determine high- level actions, such as accelerate, decelerate, and maintain current speed [24]. Other researchers in [25] employ POMDPs for safe and efficient planning for A Vs in a crowd, where unknown pedestrian intentions are considered as source of partial observability and uncertainty. In another research work, a POMDP model is employed for planning and decision-making in A Vs within sign-controlled intersections using continuous actions, where sensor noise is accounted for as un- certainty [26]. Moreover, researchers in [27] formalize crosswalk and intersection crossing prob- lems using POMDPs with discrete action-spaces that specify various acceleration/deceleration val- ues for the A V . In this research, sensor occlusions are considered as sources of partial observability and uncertainty. Planning and decision-making using POMDP models is also employed for UA V and quad- copter operations, where various use-case scenarios, such as path-planning and navigation, target search, and search and rescue are addressed. For instance, a POMDP model is employed for aerial map reconstruction and searching for evidence of ancient water on Mars surface using Mars Helicopter [5]. In another example, researchers formalize search and rescue operations in non- urban areas using quadcopters as a POMDP, where shading and lightning effects are considered as sources of partial observability [42, 43]. Although POMDPs are proven to be theoretically and mathematically successful in addressing planning and decision-making in presence of noise, partial observability, and uncertainty, they suf- fer from scalability issues due to employing large state-spaces and policy estimation in a continu- ous domain (i.e., belief-space is continuous and solving a POMDP is a P-SPACE complete problem [28]). This implies that POMDP models are practically infeasible for real-world problems due to limited computation power on-board and time constraints posed by the environment (i.e., interac- tions between the system and environment happen in real-time). On the other hand, as discussed in examples above, POMDP models are usually designed with respect to a specific use-case scenario 6 and environment based on limited data available at the outset, which leads to unknown-unknowns and new, previously unseen observations that cannot be interpreted based on the existing state- space of the designed model. Such observations can result from missing information or changes in the dynamics of the environment and requires re-visiting the modeling assumptions, and making adaptations and refinements to account for this information when it becomes available from the environment. Otherwise, this can result in model failure or poor performance. In light of the foregoing, there is a clear need for scalable adaptive POMDP modeling tech- niques that can: 1- adhere to constraints posed by the system and environment to achieve near real-time performance and 2- efficiently adapt to changes in the environment and learn from new, previously unseen information when such information becomes available from the environment [7]. In other words, the existing scalability and adaptability limitations associated with POMDPs is the main motivation in this research. Accounting for scalability implies that various heuristics (model and/or data-driven) and approximations are needed to be introduced in POMDP models and policy estimation techniques to reduce model size and complexity and avoid unnecessary compu- tations in searching for optimal responses (i.e., decisions, plans, actions). To account for missing information and changes in environment, the POMDPs need to efficiently adapt and learn from new, previously unseen information when such information becomes available from the environ- ment [7]. One key aspect in achieving fast and efficient adaptation is the presence of appropriate learning structures and memory units. Therefore, standard model-free (i.e., no model is assumed for the system or environment, where interactions and optimal responses are learned from results of numerous random interactions with the environment) techniques (e.g., black-box, deep learn- ing techniques) do not perform well because they are tabula-rasa systems and hold no knowledge in their architectures to allow for fast, targeted learning and adaptation [9B]. On the other hand, model-based techniques, which hold knowledge of the system and environment, allow for rapid and efficient adaptation [44]. 7 1.2 Overview of previous work In general, probabilistic planning and decision-making based on non-deterministic behavior of sys- tems in uncertain, reactive, and dynamic environments is formalized and modeled using POMDPs. POMDP formalization is used to capture both non-determinisms in actions and probabilistic be- haviors in state transitions resulting from uncertainty. Various sources of uncertainty and partial observability can be accounted for using POMDPs. Examples of addressed uncertainty and partial observabilities are perception occlusion, sensor limitations and noise, hidden intentions, and con- trol inaccuracies [22, 23, 24, 25, 26, 27, 1]. A POMDP model is a state-based modeling technique, in which the non-deterministic system- environment interactions (i.e., probabilistic behavior of system in uncertain, reactive environment) are modeled using probability distributions and functions, so-called transition and emission func- tions (and probabilities). Each state s2 S in a POMDP represents a distinct and unique situation of system within its environment. POMDPs account for partial observability by assuming that the states can be only inferred probabilistically (i.e., notion of belief b t 2 B) from partial observa- tions [11, 19, 45, 46]. The transitions between states and emissions from each state are triggered based on actions that are embedded within a POMDP model (a2 A). POMDP models are usually designed with respect to a goal to be achieved within their environment, where the goal is mathe- matically defined and formulated using a reward function that determines the reward (or penalty) value for performing an action within a specific state (R(s;a)). Figure 1.2 presents an overall POMDP model structure. Basically, at each time period (e.g., time-step), the system-environment interactions is in some unknown state. The system chooses an action, which results in state transition (s! s 0 ) with a transition probability associated with it (pr(s 0 js;a)). At the same time, the system receives an observation, o2W which depends on the new state with the emission probability of pr(ojs 0 ;a). Finally, the system receives a reward for performing action a in state s, (R(s;a)) and this process repeats. The overall goal for the system in this iterative process is to choose actions that maximize the expected sum of discounted future rewards. Discounting the future rewards ensures optimality 8 Figure 1.2: Overall POMDP model structure of the action selection process by making sure that the largest possible rewards are collected as early as possible within this iterative process. The sum of collected rewards over a finite time- horizon can be summarized as shown in equation 1.1. max E[S T t=0 g t R(s t ;a t )] (1.1) The series of actions/decisions to be performed at each state that maximizes the long-term sum of rewards (equation 1.1) identifies the optimal policy, which is denotes as p : B! A. In the available literature, the POMDP work can be classified into 3 main categories: 1. POMDP modeling and formulation 2. Policy estimation 3. Model and/or policy adaptation where, POMDP modeling focuses on defining state and observation-spaces, estimating transition and emission (i.e., observation) probabilities and formulating a reward function with respect to the goal in the environment; policy estimation is associated with finding an optimal mapping between beliefs (belief-space) and available actions (action-space) to maximize the collected rewards over a time-horizon, which is usually performed using offline or online techniques and algorithms; and adaptation addresses adaptability in POMDP models and estimated policy (with an strong focus on policy adaptation only in specific problem-domains, such as spoken dialogue systems (e.g., [47])). 9 The following sub-sections in this thesis provide a detailed overview of the available research work, challenges and gaps for model formulation, policy estimation, and adaptation in POMDPs. 1.2.1 POMDP Modeling and Formulation Formalizing a planning and decision-making problem as a POMDP implies defining states from system-environment interactions, identifying an observation-space based on which states can be inferred (probabilistically), determining an action-space, where actions identify various responses to observations obtained from the reactive environment, defining transition and emission functions based on the state, observation, and action spaces and estimating transition and emission proba- bilities, and finally, formulating a reward function that mathematically represents the goal of the system in the environment [1]. Defining states and identifying a state-space is the most impor- tant aspect of POMDP modeling and formulation as the transition, emission, and reward functions are all estimated and formulated depending on the state-space. Depending on the problem and problem-domain (e.g., path planning and navigation, tactical, and motion planning) various state- space formulations can be taken into account. State-space formulation in the available literature can be classified into the following categories: 1- Grid-based state representation, 2- Join-state rep- resentation, 3- Manual state-space identification based on heuristics and experts’ judgment. Grid- based state-space representation is popularly employed in path planning and navigation, where the goal is to find an optimal (e.g., shortest and safest) path between 2 or multiple locations and perform actions to navigate towards the destination [5, 43]. In this formulation, a NXN grid-map is laid out in the environment (e.g., a pre-defined map of an area) and each grid-cell (e.g., small squares) is treated as a distinct state in the environment [5, 43]. In this formulation, state transitions are associated with visiting one of the neighboring grid-cells located next to the current grid-cell and the optimal policy identifies which grid-cell should be visited next. The observation function and observation-space in grid-based state-space formulation can be formulated to identify whether a target is located in a grid-cell that is being visited (e.g., used in search and rescue or target search problems [5, 43]) or provide partial information about the location of the agent (system) on the 10 grid-map. Figure 1.3 presents an example of a grid-based state-space in a search and rescue prob- lem. Figure 1.3: An example of a grid-based state-space representation in a search and rescue problem; (A) shows the search and rescue environment and (B) presents the grid-map laid out in the environment. The small blue squares identify distinct states On the other hand, the join-state state-space formulation is popularly employed for tactical and/or motion planning while assuming that there exists a path to follow [22, 23, 24, 25, 26, 27]. In contrast with the path-planning and navigation problem where the POMDP is employed for finding an optimal path, the assumption in joint-state state-space representation is that there exists a path, thus attempts to provide tactical and motional plans and decisions (e.g., acceleration or de- celeration rate, high-level tactical decisions such as speed up or slow down). Using the joint-state state-space representation, each state is usually defined as a combination of the system and en- vironment (including other agents within the environment, such as pedestrians and vehicles) state variables and parameters. Equation 1.2 provides a formal representation of a state formulated using joint-state state-space representation: s=(X 0 ;X 1 ;X 2 ;:::;X k ) T (1.2) where s2 S denotes a specific state within the state-space, X 0 =(x 0 0 ;x 1 0 ;:::;x n 0 ) represents the state of the system within the environment, which may include various state variables and parameters 11 depending on the problem (e.g., pose, location, velocity, and intention), similarly, X 1 ;X 2 ;:::;X k represent the states of other agents and theirs state variables and parameters within the environ- ment. Motion models and equations are employed for both determining the probabilistic state transitions (i.e., how the state variables change at each time-step given the performed action) and estimating future observations. Typically, the reward function is formulated to account for transi- tion costs, safety, and failure. There exist numerous examples of join-state state-space representa- tion in POMDP-related work in the available literature. For instance, researchers in [22] use the joint-state state-space representation to formulate a POMDP for A V planning and decision-making in complex intersections, where each state is represented as s=(X 0 ;X 1 ;X 2 ;:::;X k ) T to allow for modeling of interactive behavior in the motion model, which requires all the scene’s vehicles to be present in the state-space. In this formulation, the researchers identify the state of the A V as X 0 =(loc;vel) and states of other agents are identified as X i2f1;::;kg =(loc i ;vel i ;rout i ), where the route (rout i ) is hidden and 2-D motion equations are used for identifying transitions, estimating routes, and predicting future observations. In another example, researchers formalize autonomous driving decision-making in uncontrolled intersections as a POMDP, where each state within the state-space represents the state variables of the A V and other vehicle agents. Specifically, the state of the A V is defined as ([x;y;q;v;a avg ;yaw avg ]), where [x;y;q] is the vehicle pose, v is the veloc- ity, and a avg and yaw avg represent average acceleration and yaw angle for the A V . The state(s) of the other vehicle agents include the same state variables and parameters in addition to lateral and longitudinal intentions that are hidden. Similar to the previous example, 2-D motion equations (while considering pose and yaw angles) are employed for transition, intention estimation, and predicting/estimating observation. The reward function is formulated with respect to safety of de- cisions, time to cross the intersection, and traffic laws [23]. Other researchers in [25] formalize autonomous driving in a crowd as a POMDP and define the joint-state state-space representation by accounting for the position of A V (x;y), orientation (q), and instantaneous speed (v) and po- sition(s) of pedestrians (x i ;y i ), their goal (g i , which direction they are headed to) and speed (v i ). They employ simple motion models and equations to account for transitions in between states and 12 the observation function is defined using a sensor model. In addition to joint-based and grid-based state-space representation, some researchers have manually identified various states within the state-space by accounting for combinations of various situations obtained from system-environment interactions. This type of state-space formulation is employed for tactical planning. For instance, researchers in [20] model A V lane-changing problem using POMDPs, where the state-space includes all possible combinations of 3 binary state param- eters (in total 8 states) that are: LcPos (lane-changing is possible), LcInProg (lane-changing is in progress), and LcBen (lane-changing is beneficial). The action-space includes actions such as: ini- tiate lane-changing, abort lane-changing, and drive. A signal processing algorithm is employed for determining binary signals associated with state variables and the reward function is formulated manually based on expert’s judgment and reward/penalty values are fine-tuned within a simulation. Although joint-state (also known as continuous state-space representation) and grid-based state- space representations are popularly employed in POMDP modeling and formulation, this type of formulation usually leads to a large state-space resulting in scalability issues during policy estima- tion. For instance, a 10x10 small grid-map leads to a state-space with 100 states. Since continuous or joint-state state-space representation needs to account for state variables of all other agents present in the environment, the size of the state-space can grow exponentially as the number of agents increase in the environment making the model non-scalable for real-world applications of POMDPs. Majority of the research focuses on addressing the scalability issues only for policy estimation and they do not account for model scalability. On the other hand, although manual state-space representation based on experts’ judgment drastically reduces the size of the state- space, it may not be feasible for all problem domains with different complexity levels. To enable POMDP modeling for real-world problem-domains, systems and environments, scalability should be accounted for in model design and formulation. For instance, a compact representation of a state-space allows for employing a low-dimensional state-space that drastically reduces the com- putation time and complexity of policy estimation [22]. The ”Expandable-Compact POMDP” modeling technique presented in this thesis addresses the model scalability problem by defining a 13 compact state-space, where data analysis and machine learning techniques (unsupervised learning, such as clustering) are employed to find patterns and clusters of similar events and situations and distinct clusters/patterns are used to represent states within the POMDP model. 1.2.2 Policy Estimation Many planning and decision-making problems can be formalized as POMDPs, but very few can be solved exactly because of model size and computational complexity. Solving a POMDP means finding optimal actions to be executed at every possible belief within the belief-space. However, due to the probabilistic nature of the belief-space, it is almost impossible to find an action for each and every belief within the belief-space. In other words, finite-horizon POMDPs are PSPACE- complete [48, 28] and infinite-horizon POMDPs are undecidable [49, 50, 28]. In the past few years, many approximation algorithms and techniques have been developed that can be classified into 2 distinct groups: 1- Offline algorithms, and 2- Online algorithms. Offline algorithms assume that the environment and model are fixed and calculate the optimal policy prior model deployment by evaluating all possible belief-action pairs. While approximate, offline algorithms can achieve very good performance, they often require significant amount of time and computation power for solving large POMDP problems, where there exist too many possible situations to enumerate and plan for. In addition, since offline algorithms do not account for changes in the environment, small changes in the environment’s dynamics require re-computing the full policy from scratch [22]. In general, these algorithms estimate/calculate an optimal policy by iteratively optimizing a value function. The general formulation of a value function is provided in equation ??, where the optimal value of an unknown state s at time-step t depends on the optimal value of the states at previous time-step (Bellman’s principal of optimality) [51]. V t (s) = max a2A R(s;a)+gS s 0 2S pr(sjs 0 ;a)V t1 (s 0 ) (1.3) 14 This equation cannot be directly applied to solving POMDPs since it requires the values from previous time-steps be available for all possible beliefs within the belie-space, which is almost impossible. Optimal value function of a finite-horizon POMDP can be represented by a set of ”hyperplanes”, so-calleda-vectors, where each of these hyperplanes define a linear value function over the belief-space associated with an action a2 A [52]. Combinations of all thesea-vectors in the belief-space (after eliminating sub-optimal hyperplanes) determines the optimal value-function. Researchers in [53] and [54] have used this technique to develop exact value-function algorithms. On the other hand, online algorithms try to circumvent the complexity of computing a pol- icy by planning online (when model is deployed in the environment and system interacts with the environment) by estimating the best policy locally for the most recent belief obtained from system- environment interactions [1]. Online algorithms are sometimes also called agent-centered search algorithms [55]. The main idea behind online algorithms is to estimate and construct a ”belief- tree”, from the most recent belief obtained from the system-environment interactions, where the most recent belief is at the root node of the tree and nodes in the lower levels represent the possible future beliefs estimated from the root. The goal is to traverse the tree in some order (e.g., left to right, bottom-to-top) and find the branch (possible series of optimal actions/decisions) that leads to the highest discounted reward. Figure 1.4 represents an example of a belief-tree, where the root node is denoted as b 0 . Although online algorithms are much faster compared to offline techniques, they also suffer from scalability issues. This is due to the fact that the size of the possible future beliefs grows exponentially with the number of actions and observations as the search for optimal path goes far in the future (the tree grows deeper) [1]. On the other hand, looking as far as possible in the future (growing deeper belief-tree) increases the accuracy of the estimated value and policy. This implies that there is a trade-off between the computation time and accuracy in online algorithms that needs to be addressed. Chapter 3 in this thesis provides a detailed discussion on various online and offline algorithms and approximations employed for reducing policy estimation time and complexity. The major difference between online and offline algorithms is that while offline algorithms 15 Figure 1.4: Example of a belief tree constructed during the planning/estimation phase compute an exponentially large contingency plan considering all possible situations, online al- gorithms only consider the current situation and a small horizon of contingency plans. Some of the online algorithms are capable of handling changes in the environment without requiring re- computations, which allows online algorithms to be applicable in many contexts where offline approaches are not sufficient/applicable. One of the examples of changes in the environment can be associated with changes in tasks that are needed to be accomplished, as defined within the re- ward function, in the environment. One drawback of online planning, in addition to the existing trade-offs, is that majority of the powerful online policy estimation algorithms usually rely on some information to be available a-priori to efficiently guide the search towards optimal solutions. Offline computations are usually employed for providing a-priori information to online algorithms [1]. Table presented in figure 1.5 summarizes the differences between online and offline algorithms and techniques, where drawbacks and advantages are highlighted using red and green colors, re- spectively. This thesis presents an online, adaptive policy estimation technique, implemented as ”N-Step Look-Ahead” algorithm that recursively constructs a belief-tree and traverses various branches to 16 Figure 1.5: Offline vs online algorithms and techniques calculate value associated with each action for the most recent belief [1]. To account for time- accuracy trade-offs and achieve near real-time performance, this algorithm uses model and data- driven heuristics and data analysis techniques to reduce the size of the tree and eliminate sub- optimal solutions. The N-Step Look-Ahead algorithm is an adaptive algorithm since the policies can be calculated online as the model/environment goes through changes (.g., model adapts to changes in the environment). To calculate optimal policies, this algorithm does not rely on infor- mation or offline calculations to be available a-priori. 1.2.3 Adaptation and Refinement in POMDPs As mentioned before, majority of the POMDP models are designed and formulated for a specific task/goal within a pre-defined environment where only a subset of the existing uncertainties and challenges are addressed. These models are usually formulated by assuming that all information required to address planning and decision-making in an environment (and use-case scenario) is available at the outset. In other words, most POMDP problems and policy estimation algorithms require the model to be completely known a-priori and to remain unchanged during runtime. Con- sequently, such models are forced to model all unknowns (e.g., routes for all available vehicles within the perimeter of A V in autonomous driving using POMDPs), resulting in excessively large 17 models that pose computational problems [56]. However, this is not a valid assumption, especially in real-world environments, due to the dynamic and ever-changing nature of the environment. This implies that in real-world applications of planning and decision-making using POMDPs, there will be new information (e.g., unknown-unknowns) resulting from changing environment dynamics and goals that does not exist in the initial data available at the outset and there may be situations that contradict with the initial modeling assumptions [1]. To properly respond to information (i.e., observations) obtained from the environment, the designed POMDP models and their policies need to get updated to adapt to changes and account for new information. Adapting POMDPs may require modifying the initial modeling assumptions and structure and updating the underlying dynamics, which can lead to estimating a new policy. Available methods for handling such changes can be classified as following: 1. Re-planning 2. Modeling all possible changes prior to undergoing changes 3. Transferring knowledge and experience [57] Re-planning focuses on policy estimation only and assumes the information required for updating the policy is already available in model. Thus, a new policy is estimated from scratch while dis- carding all useful information and experience gained from deploying the existing model within the environment. The second technique models each parameter that may change as a “state variable” of a new, enlarged POMDP model [58]. However, such models can quickly become extremely large and unwieldy, beyond the capability of the best policy estimation techniques and computers today [57]. Finally, transfer learning (TL) relies on reusing the available experience and builds upon the existing model or policy in order to adapt to a new environment or task. Transfer learning has been popularly employed in deep learning and reinforcement learning problems. For instance, for the purpose of object detection using convolutional neural networks, some layers of a neural network model that is trained on a different dataset can be retrained (while freezing all other pa- rameters and layers) on another dataset to detect objects that don’t exist in the initial dataset, which 18 is much faster and more efficient than training a new neural network model with random weights from scratch [59, 60, 61]. Despite significant advances in POMDP modeling and TL techniques, gaps and limitations remain. The first limitation is that, in the available literature, TL techniques are mainly focused on deterministic problems that are formalized as MDPs not POMDPs [56]. In fact, there exists only a few research works associated with adapting in POMDPs, which mainly focus on policy adaptation rather than model and policy adaptation. Model and policy adaptation is only addressed in spoken dialogue systems (e.g., [47]). The second limitation is associated with assumptions of transfer learning. The main assumption in transfer learning is that the new environment dynam- ics, tasks, information, and differences between the new and old environment are known a-priori, where this information is used for defining ”mapping functions” to properly map tasks, goals, and information from the old environment to the new one. Figure 1.6 provides an example of a map- ping function for a decision-making and planning problem in reinforcement learning. However, this is not a valid assumption given that there may be unknown-unknowns, such as hidden states, new, previously observations, and hidden environment dynamics that should be take into account. Figure 1.6: An example of a mapping function defined for transfer learning between 2 environments with the same goal 19 To address the existing limitations and gaps in POMDP adaptation and refinement, this thesis discusses how probabilistic models formalized as POMDPs can be defined only using limited in- formation such that they can undergo changes and gradually account for new information while accounting for scalability to achieve near real-time performance (Expandable-Compact POMDPs). Then provides a novel, hybrid model-based, data-driven technique and algorithm for adapting and refining these models without relying on any information regarding the new environment/task to be available a-priori. This technique includes two phases: 1) online adaptation via expansion and 2) offline post-expansion refinement. In the initial phase, which is performed online (model interacts with its environment), new, previously unseen information is collected and accounted for using data and model-driven heuristics and analytics by gradually adding new states and observations to the model and updating underlying dynamics by measuring similarity between new and existing data (e.g., state and observation spaces). Later, in the offline phase (post-expansion), the perfor- mance of the adapted model and collected data are employed to identify possible inaccuracies and inefficiencies to make further refinements (e.g., adding new actions, removing, or joining states). Chapter 4 discusses this technique and algorithm in details. 1.3 Research Objectives and Hypothesis The main goal of this research is to develop methods to enable adaptability (model adaptability and policy transfer) in POMDP models and achieve accurate, efficient decision-making with near real-time performance. In the light of foregoing, the objective of this research can be summarized as follows: Establishing feasibility of adaptation and transfer learning with POMDP models for engineering complex systems to achieve accurate near real-time decision-making in dynamic and uncertain environments. To achieve the specified research objectives, the following activities are performed in this research to enable the research objectives: 20 • Integrating appropriate model and data-driven heuristics, machine learning, and data anal- ysis techniques with existing POMDP modeling, policy estimation, and transfer learning techniques • Introducing various approximations to develop a POMDP modeling technique that: accounts for new information as it becomes available via adaptation and transfer learning, and per- forms accurate decision-making with near real-time performance in real-world problem do- mains Based on the identified goals and objectives, the research hypothesis can be summarized as follows: Certain approximations can be introduced to POMDP modeling using heuristics, machine learning, and data analysis techniques to realize POMDP modeling for decision-making and planning in real-world problems by: • Enabling gradual model adaptation and refinement by scaling up models to become accu- rate representation of the true system-environment interactions without adding unnecessary complexity • Containing state-space explosion and reducing computational load to achieve near real-time performance For this purpose, a new modeling technique, ”Expandable-Compact POMDP”, is provided in this thesis, in which the scalability issues are addressed both in model design and formulation (espe- cially in determining and defining the state-space) and policy estimation. Specifically, heuristics, data analysis and unsupervised learning (e.g., clustering) are employed in model design and for- mulation to summarize the data available at the outset by determining clusters and patterns of similar events, which are then employed to represent a compact state-space and reduce the size and complexity of models. Moreover, data analysis techniques (distance-based clustering) and heuristics (e.g., reachable observations) are employed in developing a policy estimation technique for expandable-Compact POMDPs to reduce policy estimation time and cost (computation power) by avoiding unnecessary and redundant computations to achieve near real-time performance. The 21 policy estimation technique is implemented as ”N-Step Look-Ahead” algorithm, which is an on- line, recursive algorithm that estimates the expected value of performing an action (for all actions within the action-space) given the most recent belief by recursively building and traversing a belief- tree. To further reduce the computation time, parallel processing is integrated with this algorithm, so that the belief-action value for all available actions can be computed parallelly, reducing the value estimation time approximately to T total =jAj. Adaptability, refinement, and accounting for new (missing) information is considered by gradually expanding the state-space and underlying dynam- ics in a controlled way to account for possible missing states and observations as new, previously unseen information becomes available from the environment. Gradual adaptation and refinement is implemented in 2 phases: Phase 1- Online adaptation, where possible new states are gradually added to the model and their dynamics are estimated and continuously updated by measuring the ”similarity” between the new and existing information (clusters of observations and distribution of data within each cluster/pattern) using data analysis techniques (e.g., weighted Euclidean dis- tance). This technique is implemented using the ”adaptation via expansion” algorithm. A labeling function is also implemented and employed within this algorithm to avoid over-expansion and ini- tialization of redundant states to account for scalability of model by maintaining enough size and complexity. Phase 2- Offline, post-expansion refinement, where various visual and statistical data analysis techniques are employed on collected data (during expansion) and performance of the expanded model to identify further refinements, such as adding new actions and joining states. 1.4 Methodological Approach In this thesis, a combination of methods, including literature review, data collection and analysis, heuristics, modeling and simulation, and experiments are employed. Initially, a comprehensive literature review and background study is conducted on various POMDP modeling techniques employed for planning and decision-making in uncertain environments, including A V and UA V 22 planning and decision-making and spoken dialogue systems (with a focus on transfer learning tech- niques). The main focus of the literature review is associated with understanding and identifying various techniques and approaches employed for defining state-space of models, observation mod- els, and observation-spaces based on data, heuristics, and intuitions. Moreover, different sources of uncertainty, noise, and partial observability are also studied and reviewed. For initializing incomplete, partial models, data collection and unsupervised learning tech- niques are employed to define states from partially available observations, identify which obser- vation types can be interpreted based on the defined state-spaces, and initialize other parameters (e.g., probability distributions/functions) in model. For this purpose, various experimentations and use-case scenarios are simulated within a simulation environment (e.g., simulated using Python) and simulated data is collected from experimenting within each use-case scenario. Moreover, sim- ulation is also used to evaluate and verify the accuracy and performance of developed models in various use-case scenarios. To account for scalability in policy estimation, initially a simple on- line algorithm is implemented and gradually integrated with different heuristics and data analysis techniques, for which the performance and accuracy is measured and evaluated in a simulation en- vironment. For this purpose, various POMDP models are defined and the performance of the policy estimation algorithm in estimating policies are evaluated and compared in numerous use-case sce- narios with different complexities. Finally, statistical data analysis techniques are employed to analyze the performance of the algorithm and accuracy of estimated policies. For comparison pur- poses, a benchmark policy estimation technique (Q-learning algorithm) is also implemented and used for estimating policies. Later, a few numbers of beliefs are sampled from the belief-spaces of the models and online policies (estimated using N-Step Look-Ahead) are compared to offline policies. To account for adaptability and refinement in developed models, initially a simple heuristic, ”most expected outcome” is developed and used to expand the model, where the existing tran- sition and emission probabilities associated with available actions are employed to identify the most expected outcome associated with each action. This heuristic is integrated with one of the 23 developed models and the expanded model is tested and evaluated in a simulation environment. For comparison purposes, an end-to-end (directly mapping observations to actions) deep learning decision-making model using neural networks is also implemented, whose performance and ac- curacy is tested and compared to the performance of expanded POMDP models. To generalize the expandability and adaptability technique, various data analysis techniques are employed on exemplar POMDP models to identify how a ”similarity function” can be defined for the adaptation process. This similarity function is used to measure the similarity of the new, previously unseen information with the existing clusters of observations within the exemplar observation-spaces, so that possible missing new state(s) can be added to the model and dynamics can be estimated. In ad- dition, model-driven heuristics are employed during adaptation to efficiently update the estimated dynamics as more data associated with added states are obtained from the environment. Simula- tion and experimentation using simulated use-case scenarios (with different scenario parameters) are employed to test and verify the adaptation technique. Finally, statistical and visual (e.g., perfor- mance plots and diagrams) data analysis techniques and heuristics are employed to make further refinements in the adapted POMDP models. The adapted and refined model(s) with the N-Step Look-Ahead algorithm are later integrated with a simulation environment and various types of performance data depending on the goal within the simulated environment is collected and a ”per- formance score” is calculated using a performance metric formulated for the use-case scenario. For comparison purposes, a benchmark transfer learning technique using Q-learning algorithm is implemented by assuming that the information and model for the new and old environment are available a-priori. This technique is employed for transferring policies between the old and new model (the adapted and refined model) and environment and the transferred policies are also tested and evaluated with simulated data obtained from the simulated environment. The performance of the transferred policies is then evaluated using the performance metric and a performance score is calculated. Finally, the same approach is applied to a different POMDP model with the purpose of adaptation to a complex use-case scenario. Finally, the adapted and refined POMDP models are integrated with CARLA within a highly complex scenario to be verified and tested. 24 1.5 Thesis Overview The second chapter of this thesis focuses on POMDP modeling and formulation and provides the definition and formulation of state, action, observation spaces, and discusses how transition, ob- servation, and reward function using the Expandable-Compact POMDP modeling technique. In this chapter, an overview and formal definition of POMDP models are provided and examples are given. Later the Expandable-Compact POMDP models are formally defined, and notations are specified. Two exemplar POMDP models using the Expandable-Compact POMDP model- ing technique are defined for autonomous driving decision-making and planning in a multi-lane free-way environment simulated in Python. Specifically, one POMDP model is defined for safe and collision-free lane keeping and another POMDP model is developed for safe, smooth, and collision-free lane changing in the multi-lane free-way environment. The last subsection in chap- ter 2 summarizes the discussions within this chapter. The third chapter in this thesis explains policy estimation in POMDPs with a strong focus on online algorithms. An overview of various approximate offline and online solutions is provided in this chapter in addition to heuristics ad techniques employed for addressing scalability issues (state-space explosion) in policy estimation.The existing drawbacks, limitations, advantages, and trade-offs for the existing algorithms and techniques are also presented in this chapter. A detailed explanation of implementing the N-Step Look-Ahead algorithm is provided with a pseudo-code for this algorithm, where the time and computation complexity of the algorithm are also discussed. Later in this chapter, the N-Step Look-Ahead algorithm is employed for policy estimation in the lane-keeping and lane-changing POMDPs and the performance is evaluated in a simulated en- vironment. An end-to-end neural network model is also implemented and integrated with the simulation, whose decisions are compared to the estimated policies in terms of failure rate (i.e., how many times the decisions from the neural network and POMDP models lead to transitioning to failure state(s)). This chapter also includes a detailed discussion on comparing online policies to offline policies estimated using a benchmark offline policy estimation technique, Q-learning, 25 which is customized for POMDPs. To evaluate the performance of the N-Step Look-Ahead al- gorithm on path planning and navigation problems, this chapter presents the results obtained for using this algorithm within an active search use-case scenario, where the goal is to have the Mars helicopter perform a guided search on a specific area on Mars to construct a high-resolution aerial map of the area and search for possible indications of ancient water on Mars surface. For this example, the computation power of the onboard computer and physical characteristics of the he- licopter (e.g., maximum flight time and altitude) are taken into account and N-Step Look-Ahead algorithm is integrated with parallel processing to further reduce the policy estimation time. In addition to evaluating the computation time, the total number of steps required for completing the search is calculated with different look-ahead steps (i.e., N) and an appropriate look-ahead value is chosen to perform the search in a simulated environment. Finally, the summary and conclusion of this chapter are provided in the last sub-section of this chapter. Chapter 4 in this thesis focuses on adaptation and refinement techniques in POMDPs. Specifi- cally, transfer learning techniques are discussed, and various examples and algorithms are reviewed in addition to a detailed discussion on transfer learning assumptions, limitations, and advantages. Adaptation via expansion and post-expansion refinement technique and algorithm is discussed in details and the pseudo-code for the ”online adaptation via expansion” algorithm and labeling func- tion are provided. The employed heuristics and data analysis techniques are also presented in this chapter. To present a step-by-step application of the adaptation and refinement technique, this technique is applied to the lane-keeping POMDP model so that the model can adapt to a complex use-case scenario (late-reveal scenario) and environment, where risky behaviors such as sudden stopping on the lane or cutting off to other lanes exists for other vehicle agents within the envi- ronment. The performance of the adapted and refined POMDP model is then evaluated in 137 scenarios (with different parameters, where the parameters identify the severity of risky behaviors) and performance data, such as time-to-collision, failure rate, policy estimation time, and robust- ness of model is calculated for performing analysis and calculating a performance score for the model. To compare with the state-of-the-art techniques, a benchmark policy transfer technique is 26 implemented using Q-learning and policies are transferred and evaluated again in 137 scenarios. The results of this comparison and details of implementing the policy transfer technique are also presented in this chapter. A probabilistic, model-based performance evaluation and comparison technique is also presented in this chapter and this technique is employed for comparing different models in terms of estimated failure probability (the overall failure probability and min-max failure probability ranges are calculated and compared). Later, the same adaptation and refinement pro- cedure is applied to the lane-changing POMDP model to adapt to a more complex lane-changing use-case scenario, where other vehicle agents can speed up to close the gap when A V has initiated lane-changing. Finally, the adapted and refined POMDP models (lane-keeping and lane-changing) and the N-Step Look-Ahead algorithm (with parallel processing) are integrated with CARLA. De- tails of the CARLA integration process are also discussed in this chapter. The last subsection in this chapter provides a summary and conclusion. The final chapter in this thesis summarizes the presented research work and discusses future directions. Moreover, potential advances and contributions to Model-Based Systems Engineering (MBSE) are also discussed in this chapter. 27 Chapter 2 POMDP Modeling and Definition 2.1 POMDP Theory and Mathematical Formulation Partially Observable Markov Decision Process (POMDP) is a state-based modeling technique for planning and decision-making in uncertain, non-deterministic environments. POMDPs are gen- eralization of Markov Decision Processes (MDPs) to a probabilistic domain. These models are generally used when the true state of the environment or system-environment interactions cannot be fully observed from the observations and can be only inferred from partially available, noisy observations. Basically, POMDPs are similar to Hidden Markov Models (HMMs) as both assume that the true state of the system (or environment) is unknown, but it can be inferred based on obser- vations. In addition, POMDPs and HMMs assume that the inference about the current state does not depend on the entire history of the system (Markovian assumption) [17]. The key difference between HMMs and POMDPs is that there is an additional aspect of decision making in POMDPs and they are generally used to model systems in which the system interacts with its environment using a series of control actions, where observations depend on both actions and the current state. However, in HMMs, the state transition depends only on the current state and observation, and ob- servations are generated according to a distribution that depends only on the corresponding state. Formally, a POMDP is defined using a tuple S, A,W, T, O, R where: • S is a finite set of all possible states (s2 S) 28 • A is a finite set of all possible actions (a2 A) • W is a finite set of all possible observations (o2W) • T : SAS![0;1] is the transition function that provides the probabilities associated with performing an action in a state, and transitioning to other state(s) • O : S AW![0;1] is the emission/observation function that provides the probabilities associated with preforming an action in a state, and receiving an observation • R : S A! is the reward function that provides the reward (or penalty) for performing an action in a state [19] To address the uncertainty and partial observability associated with observations, a probabilistic distribution over all possible states, so-called belief b t 2 B, is used to summarize past information and infer (probabilistically) the current state of the system-environment interactions from noisy and partially available observations. If there exists a-priori information, such as prior probabilities associated with states, this probability distribution can be employed for initializing the belief at time-step t= 0s, b t=0 . Otherwise, the initial belief, b t=0 , can be initialized as a uniform probability distribution over all possible states as shown in equation 2.1. 8s2 S : b t=0 (s)= 1 jSj (2.1) As the time progresses and new observations become available from the environment, b t gets updated based on the current belief probabilities, performed action, a, and obtained observation, o, using Bayes’ rule as shown in equation 2.2: b t+1 (s i )= p(s i ja;o;b t )= p(ojs i ;a)å s2S p(s i js;a)b t (s) å s 0 2S p(ojs 0 ;a)å s2S p(s 0 js;a)b t (s) (2.2) where p(ojs i ;a), emission probability, is the probability associated with observing o after perform- ing action a in state s i and p(s i js;a), transition probability, is the probability of performing action 29 a at state s and transitioning to s i . The denominator in equation 2.2 is a normalization factor so that the belief probability distribution always sums up to 1. 2.2 Existing POMDP Formulations and Applications As mentioned before, a POMDP is a probabilistic, state-based modeling technique that enables continuous planning and decision-making in reactive, dynamic environments in presence of vari- ous sources of uncertainty. Uncertainties can be related to missing or noisy data due to the limited sensor capabilities (e.g., occlusion or shading affect in cameras), hidden information, such as in- tentions of other agents/systems in the environment, and unanticipated outcomes of performed actions due to vehicle physical constraints and weather conditions. There exists various applications of POMDPs in the available literature, such as: path planning and navigation in Autonomous Vehicles (A Vs) and Unmanned Aerial Vehicles (UA Vs), which is employed for various use-case scenarios, such as search-and-rescue and active visual search (in UA Vs); and motion (control level planning) and maneuver (tactical level planning) in A Vs, where examples of use-case scenarios are intersection crossing, pedestrian and obstacle avoidance, lane- keeping, and lane-changing. Depending on the application and problem-domain (i.e., navigation, motion, and maneuver (tactical planning)), various formulations can be employed for designing POMDP models. In general, grid-based state-space representation is employed for path planning and navigation problems and join-state state-space (also known as continuous state-space) repre- sentation is employed for motion and maneuver planning. Some examples of POMDP applications and formulations are provided in the following paragraphs. As one example of planning and decision-making in A Vs, researchers in [22] address motion planning for autonomous driving in complex use-case scenarios and environments such as, merging on a T-junction and unprotected left-turn in an intersection. The subset of uncertainties accounted for in this example are associated with noisy sensor data and hidden intentions of human-driven vehicles in the environment. The state-space is formulated based on behaviors of A V and other 30 vehicles in the environment using a joint-state state-space representation, where a state, s t , is rep- resented as s t =(s 0 ;s 1 ;:::;s K ), where s 0 shows the state of the A V and s k ;k2 1;:::;K provides the states of surrounding vehicles. The state of the A V s 0 provides information regarding the A V pose and velocity, where the state of other vehicle agents, s k , provides information about the pose, ve- locity and estimated route for that vehicle (route is hidden). The actions are defined using different acceleration rates for either speeding up or slowing down and the transition function is defined us- ing 2-D motion equations. The reward function is defined to account for a combination of factors, such as reaching the terminal state, collision, deviation from a reference velocity, and comfort. The observation-space in this POMDP model is consists of the A V’s route (which is known) and the routs of the other vehicles in the surrounding area. The observation model uses a Naive Bayes clas- sifier to estimate and predict the routs for other vehicles. To verify the designed POMDP model, the model is evaluated in a simulation environment. In another example, researchers address the problem of motion planning in A Vs in uncontrolled intersections, while accounting for the estimated intentions of other vehicles in the area [23]. In order to consider uncertain intentions, a continuous HMM is employed to predict both high-level motion intentions, such as: turning right or left and low-level interaction intentions, such as: yield- ing status for related vehicles. In this research, the researchers focus on perception inaccuracies and unclear motion intentions as sources of uncertainties. The state-space of the POMDP model is formulated using joint-state state-space representation and includes the vehicle pose x;y;q , velocity v, the average yaw rate yaw ave , and acceleration a ave . Thus, the joint-state s2 S is de- noted as s= s host ;s 1 ;s 2 ;:::;s N , where s host is the state of the A V , and s i ;i2 1;2;3;:::;N is the state of human-driven vehicles. The action-space for this problem is defined using a discrete set of high-level actions, A= acc;dec;cont , which commands the A V for accelerating, decelerating, and maintaining current velocity. Individual observations are denoted as z= z host ;z 1 ;z 2 ;:::;z N , where z host and z i are the host vehicle and human-driven vehicle’s observations, respectively. The transition function in this problem is modeled using a combination of probabilistic and 2D vehicle motion functions. The observation model/ function is built to simulate the measurement process. 31 The measurements of human-driven vehicles are modeled with conditional independent assump- tion. The host vehicle’s observation function is denoted as shown in 2.3. p(z host js 0 host ) N(z host jx 0 host ;S z host ) (2.3) Since the researchers also account for V2V communications between vehicles, the observation er- ror almost does not affect the planning result. The reward (i.e., objective) function is a weighted sum over various factors, including safety, time efficiency, and traffic laws. After formulating the POMDP model for this problem, the researchers compare the performance of their model to reac- tive models in a simulation environment. In a different example, similar to the previous example, the researchers account for un-observable motion intentions and perception noise as sources of uncertainty in the environment and address motion planning in urban roads [24]. In this research, an urban environment is modeled, where motion intentions are designed with four hypotheses: Stopping, Hesitating, Normal, Aggressive. The state-space for the POMDP model accounts for vehicle pose and velocity. The joint state s2 S is given as s= s e ;s 1 ;s 2 ;:::;s K , where s is composed by the A V’s state s e and the other vehicles’ state s i ;8i2 1;2;:::;K , and K is the number of obstacle vehicles involved. The action-space is defined as A= acc;dcc;cont . The observation-space includes the pose and velocity associated with different vehicles in the environment. The transition function describes the stochastic system dynamics driven by both the action applied to the ego vehicle (i.e., A V) and the obstacle vehicles’ motion intentions and employs a Bayesian Network for this purpose. Since the main objective of the ego vehicle in the scenario is to arrive at the target destination as quickly as possible while avoiding collisions with obstacle vehicles, the reward function accounts for a reward associated with achieving the goal, a penalty for crashing into other vehicles, a small penalty (cost) for per- forming an action, and also a reward associated with the speed of the ego vehicle. Finally, the researchers demonstrate the performance of their POMDP model in a simulated environment and compare the performance to reactive models. 32 Researchers in [20] address problem of tactical planning for lane changing in A Vs while ac- counting for sensor noise as the source of uncertainty. For the purpose of this problem, the re- searchers assess the existing situations for lane changing by considering the dynamic objects in front and behind of the A V in current and adjacent lanes. The situation is evaluated by a signal processing network, where the outputs of this network, whether a lane changing is possible or not and whether a lane changing is beneficial or not, are sent to the POMDP model (as observations) for tactical (maneuver) planning. In contrast with the previous examples, the state-space is defined manually and includes 8 distinct states, where each state represents a combination of three inde- pendent binary variables that are: LcPos, LcBen, and LcInProg. LcPos describes whether a lane changing is possible, LcBen shows whether a lane changing is beneficial, and LcInProg indicates whether a lane changing is in progress. The action-space includes high-level, maneuver related actions, such as: regular driving(straight) ahead, initiating a lane changing, and aborting a lane changing. The reward function is defined as ajSjjAj matrix that includes the reward/penalty values associated with performing different actions in different states. All elements of the re- ward matrix are set to zero except for the state-action pairs that lead to achieving the goal (safe, smooth, and beneficial lane-changing) or unsafe, risky situations, such as failure or collision (e.g., r(a= InitiateLC;s=(:;:InProg;:))=100, r(a= InitiateLC;s=(:;InProg;:))=10000, and r(a= Drive;s=(LcPos;InProg;LcBen))=+50). The transition function is also defined as a ma- trix and the p(s 0 js;a) is initialized such that state transitions roughly fit with observed transitions in real-world driving scenarios and the status quo transitions are initialized by using expert knowl- edge. Finally, the researchers use receiver operator curves (ROC) to demonstrate the performance of their model. Other researchers in [26] address motion planning for A Vs in an intersection traversal prob- lem, where they account for sensor noise as the source of uncertainty in the environment. The researchers assume that the intersection traversal problem is an MDP; however, the ego vehicle cannot directly observe everything happening out of visibility range due to the geometry of the intersection. The goal of the A V is to traverse the intersection and reach a pre-defined destination 33 beyond the intersection to complete the following tasks: Turn Right, Go Straight, and Turn Left. The main assumption in this problem is that the path for completing the tasks are available as a- priori. The researchers propose two methods to solve the problem. The first method models the problem as a POMDP and uses past observation and action pairs to output continuous actions at each time-step. The second technique models the problem as an MDP with hierarchical options [62], which only takes the latest observation as input and generates discrete high-level options as well as low-level actions simultaneously. The continuous state-space contains the A V’s veloc- ity and road geometry information, including the distance between the A V and lower intersection boundary, mid-point, and destination beyond the intersection. The reward function is defined based on summation of various positive values (rewards) and negative values (penalties) associated with different situations in the intersection, such as: reward that reflects the progress towards the des- tination, penalty that penalizes the agent getting closer to other traffic participants and obstacles, constant penalty for a crash, constant penalty for situations where the task cannot be accomplished within a pre-defined number of steps (delay in accomplishing the task), and finally a positive re- ward for reaching the goal. Researchers in [27] demonstrate a generic POMDP approach for decision making applied to autonomous driving scenarios (crosswalk and intersection) with sensor occlusions. The state-space includes the poses of the ego vehicle (i.e., A V) and other vehicles to be avoided. The action-space for these scenarios is described based on various maneuvers, such as hard braking, moderate brak- ing, maintaining constant speed, and accelerating. The observation model for these two problems is defined similar to the state-space. The researchers assume that the ego vehicle can perfectly observe its own pose while it receives noisy measurements of agent pose in the visible areas. To formulate a reward function, the researchers assign a positive value for reaching a final position and location and a penalty term for collision, where the value of the collision penalty is tuned within a simulation environment. Finally, the researchers evaluate the performance of the model in a simulation environment and compare the resulting decisions to a pre-defined series of decisions. 34 POMDP modeling and formalization has also been popularly employed for aerial robotic mis- sions, using quadcopters, drones, and UA Vs to address problems such as path planning and nav- igation. Examples of POMDP applications in UA Vs and quadcopters are target detection and recognition, target tracking, and search and rescue. As an example, researchers in [63] address the problem of target detection and recognition by UA Vs. The high-level decisions of UA Vs (e.g., fly to a given zone, land, take image, etc.) depends on various stochastic events, such as whether target is detected in a given zone, that may arise when executing the decision rule. In this problem, the goal to be accomplished by the UA V consists of detecting and identifying a car that has a par- ticular model among several cars scattered in an unknown environment and landing close to this car. The total number of states in the state-space of the POMDP model for this problem depends on various discrete variables, such as number of different zones in the environment, N z , differ- ent height levels that UA V can fly (N h ), and number of different vehicles/cars in the environment, N models . The researchers assume that there are only 3 different models of vehicles (e.g., model A, model B, and model C), and the observations for the POMDP model are defined as: no car de- tected, car detected but not identified, identified as model A, identified as model B, and identified as model C. The action-space for this POMDP model includes a set of discrete actions associated with changing zone, changing height, changing view angle of on-board camera, and landing. The transition function is a deterministic model, where the next state of the UA V is directly calculated based on its current state and action in the environment. Although the transitions are assumed to be deterministic, this problem is still non-deterministic, since observations of cars’ models that are obtained from an object detection model are probabilistic. The observation model (object detec- tion model) is learned using data-driven techniques applied to imagery data available at the outset. Finally, the reward function is formulated based on various costs assigned to performing actions in the environment, penalty for landing next to a wrong target, and positive reward for landing next to the correct target. In another example [42], the researchers address the problem of search and rescue in UA Vs using a POMDP model. The goal is to survey a non-urban area and collect evidence about the 35 location of a missing target (e.g. person). To maintain the information on the probability of the target location (i.e., missing person’s location), the UA V maintains a grid-based probabilistic map (i.e., belief-map) composed of cells that represent the discretization of the search space. Each cell in this map contains the probability that the target is present in that cell (grid-based state-space). The observations describe whether a target is detected from images or not (binary classification) and accounts for false positives and false negatives to address uncertainty in observations. High- level actions of the UA V in this example are associated with moving in different XYZ directions (diagonal actions are not employed). In general, the POMDP modeling for planning and decision-making in the available literature mainly focuses on addressing the problem in presence of uncertainties, where the type and source of uncertainty is specified as a-priori. For this purpose, researchers develop POMDP models for a problem domain while addressing only a subset of specific sources of uncertainty in the environ- ment, which is known and identified prior to model development. Moreover, the existing POMDP models are specifically designed to work under certain assumptions defined for the problem domain (e.g., assumptions about environment and constraints) and they neglect to identify how model can be generalized to a group of problems, rather than a specific scenario. For instance, the researchers do not address what changes/ refinements are required in the model based on the current design and assumptions to make the model applicable to similar problems with slightly different assumptions and environments. Most importantly, the existing research in the POMDP modeling and develop- ment area addresses the uncertainty and partial observability problem by extending deterministic models into probabilistic domain and mainly rely on probabilistic and optimization techniques to solve the problem mathematically. In other words, the main effort is associated with addressing the problem from a mathematical point of view rather than the model design and development per- spective. Thus, addressing new and previously unseen information/observations, which is a major source of uncertainty in real-world complex environments and systems is neglected. In addition, majority of the existing work in the A V problem-domain and addressed use-cases focus on motion planning using joint-state state-space representation and 2D dynamic motion models. The main 36 limitation associated with such POMDP model definition and formulation is associated with the size and complexity of the model. Accounting for distinct state variables of agents within the en- vironment leads to a large state-space and results in scalability issues. Moreover, formulating a reward function based on continuous state-space and dynamic motion models is a sophisticated task and requires tremendous amount of fine-tuning within a simulation environment. This prob- lem also exists in formalizing path planning and navigation (for target search and tracking) in UA Vs and quadcopters as POMDPs. For instance, while the state-space in [63] only accounts for N z pre-defined zones, N h pre-defined height levels, and only 3 different car models (A, B, and C), the size of the state-space is N z N h (jmodelsj+ 1), which results in a large state-space leading to scalability issues. 2.3 Expandable-Compact POMDPs To enable POMDP modeling for real-world problem domains and applications, the POMDP mod- els need to account for model scalability and adaptability to adhere to constraints posed by the system and environment and achieve near real-time performance while gradually account for new information (maintaining enough size and complexity) in the model as the new information is re- vealed from the environment [1]. For this purpose, the formal definition of standard POMDPs is refined and extended to develop a new modeling technique, Expandable-Compact POMDPs. As discussed in the previous section, majority of the POMDP-related work employs a con- tinuous state-space and focuses on motion planning (operational level), which typically leads to large state-spaces and a complex model. To address scalability in POMDPs, Expandable-Compact POMDP modeling technique focuses on maneuver planning (tactical level, instead of operational level) and high-level tactical decisions rather than operational level decisions (control commands) [1]. In addition, to reduce the model size and complexity, the Expandable-Compact POMDP modeling technique employs data analysis and machine learning techniques (e.g., clustering or classification) to find clusters or patterns of similar events in the data available at the outset and 37 uses distinct clusters/patterns to represent individual states (clustering/classification is employed for defining an approximate high-level state-space). Depending on the data type and state variables, various methods, such as decision-trees, distance-based clustering, K-means, and Gaussian Mix- ture Models (GMMs) can be employed for finding clusters or patterns [64]. The term ”compact” implies that each individual state represents a collection of various combinations of state variables with similar behavioral patterns or distributions (rather than individual combinations of state vari- ables) [1, 3, 4, 5, 46, 20]. Figure 2.1 represents an example of defining compact, high-level states based on combination of state variables within a continuous state-space. Figure 2.1: Example of defining compact, high-level states by finding patterns/clusters (B) of combinations of state variables with same behavioral patterns (A) In the compact state-space representation using clusters and patterns, the goal and failure events are also modeled as states within the state-space. Therefore, states can be categorized as: goal, transient, and failure. Based on this categorization and since the goal and failure events/situations are also modeled as states within the state-space, the reward function can be defined by directly assigning positive/negative values to the states as shown in equation 2.4. In this formulation, the transition and emission functions can be defined as (3D) matrices. The emission probability, p(o2 Cl i js k ;a l ), for each state s k and action s l combination can be defined as shown in equation 38 2.5 where o2 Cl i shows an observation from the i th class/group (i th cluster/class shows the i th state). R(s) s2S = 8 > > > > > > < > > > > > > : R(s goal )= r 1 r 1 > 0 R(s f ailure )= r 2 r 2 0 R(s transient )= r 3 r 2 < r 3 < r 1 (2.4) p(o2 Cl i js k ;a l )= S I(o2 Cl i js k ;a l ) S jSj j=1 S I(o2 Cl j js k ;a l ) (2.5) The transition matrix and probabilities can be initialized similar to the emission matrix and fine- tuned with respect to a series of expected observations. Figure 2.2 demonstrates an overview of the compact POMDP representation. One assumption in modeling using standard POMDPs is that the Figure 2.2: Example of a Compact POMDP model where states are defined based on clusters of observations with similar distributions states, observations, transition, emission, and reward matrices (or functions) are pre-defined and fixed during the model execution and deployment in the environment. This implies the assumption 39 that the initial setup and design of the model accounts for all possible situations and events in the environment. However, complex systems operate in highly dynamic environments with various sources of uncertainties and unknown situations. In other words, there may be situations (i.e., observations) that are missing from the initial dataset, which are not addressed/ accounted for in the designed model [1, 4]. Such information, when becomes available from the environment, cannot be correctly interpreted/explained by the available state-space and observation function leading to inaccurate and risky decisions and model failure [65]. This implies that there exists a subset of possibly new states that are currently missing from the model [1]. As shown in figure Figure 2.3: POMDP model with 3 states, 2 actions, and 3 observations. Upper right section of diagram shows probabilistic state estimation based on pre-known states; Lower right section shows a possible missing state resulting from unknown observations 2.3, known partially available observations result in probabilistic state determination (interpreted based on known states and accounted for in belief and emission probabilities), while unknown previously unseen information leads to identification of a possible new hidden state in model. Unknown information and previously unseen observations can result from changing environment dynamics and objectives or missing data from the initial dataset. To account for new information, the compact POMDP model can be expanded to enable accounting for new states and observations in the state and observation space respectively, and expanding the transition, emission, and reward matrices. In the light of foregoing, the Expandable-Compact POMDPs can be defined as a tuple < S + ;A;W + ;T + ;O + ;R + > where: 40 • S + : S[ H identifies the extended finite set of compact states in the state-space including the newly added hidden states H i 2 H • W + :W[Q shows the extended finite set of observations after new observations (o 0 2Q and o 0 = 2W) are identified • A + : A[ ˚ A presents the extended action-space (possible new actions may be added to account for new transitions) • T + : S + A + S + ! 0;1 shows the extended transition function that includes the proba- bilities of transitioning to and from both known and newly added hidden states • R + : S + ! determines the extended reward function including the reward/penalty values associated with added hidden state(s) • O + : S + A + W + ! 0;1 identifies the extended observation function that contains the emission (observation) probabilities of observing both o2W and o 0 2Q from all possible states (known and newly added) It is important to note that the model is defined by the < S;A;W;T;O;R> tuple at the beginning. As unknown information and previously un-observed situations (observations) become available from the environment, these empty sets gradually get accumulated by the newly discovered obser- vations and hidden states associated with new observations. When a new hidden state and observation cluster is recognized and initialized in the model, the transition, emission and reward values for this state and available actions need to be initialized in the model. Since there exists no information available a-priori associated with the new states and observations, these probabilities can be estimated/initialized using data and model-driven heuristic approaches. For instance, ”the most expected outcome” associated with performing each action can be employed to identify transition and emission probabilities associated with new states [1]. In other words, after performing a certain action, depending on the effect of the action on the pre-determined state variables and interdependencies, the most likely observations or states can be 41 assigned with higher probabilities and the less likely ones can be assigned with lower probabilities. For example, if the states are associated with different locations or directions (e.g. s 0 : north, s 1 : south, s 2 : west, and s 3 : east) and actions are associated with moving in different directions (e.g., a 0 : move to left), the transition probability associated with performing a 0 in a newly initialized hidden state (e.g., H 0 ) and transitioning to s 2 (west) would be high, because performing action a 0 changes the location such that the new location is at the left side of the previous one (with a high probability). However, transitioning to s 3 would be assigned with a low probability. On the other hand, transition probabilities from the known states to the hidden state(s) are initially assigned with small probabilities (e.g., 0.01), because transitioning from a known state to a newly added hidden state is less likely compared to transitioning to a known state. Since each individual state in the Expandable-Compact POMDP formulation represents a clus- ter/pattern of observations/events, data analysis techniques (e.g., Euclidean distance, Mahalanobis distance, correlation) can be employed to compare states and their distributions to each other and identify their similarities and differences. For instance, the distance between the centroid of a new cluster of observations (i.e., possible new state) can be measured from the centroids of available clusters (i.e., states) to identify the most similar state(s), S sim ( S (Figure 2.4) Later, the underlying Figure 2.4: Example new hidden state initialization (red cluster) after unknown-unknowns are identified based on distances to previously known states 42 dynamics for the new state can be estimated based on the measured similarity with S sim by taking a weighted sum over the dynamics of8s2 S sim as shown in equation 2.6. p(sjH 0 ;a)=S s 0 2S sim W s 0 p(sjs 0 ;a) p(ojH 0 ;a)=S s 0 2S sim W s 0 p(ojs 0 ;a) R(H 0 )=S s 0 2S sim W s 0R(s 0 ) (2.6) where W s 0 denotes the similarity weight associated with state s 0 2 S sim , where S s 0 2S sim W s 0 = 1, p(sjH 0 ;a) denotes the probability of performing action a2 A in state H 0 and transitioning to state s2 S, similarly p(ojH 0 ;a) shows the probability of observing o2W in state H 0 after performing action a2 A, and finally, R(H 0 ) denotes the immediate reward/penalty value associated with H 0 . 2.4 Developing Scalable, Adaptive POMDPs for Safety-Critical Application of A Vs To develop scalable, adaptive POMDP models for maneuver planning in A Vs, lane-keeping and lane-changing in a multi-lane freeway environment (lanes 2 and a parking lane) is formalized using Expandable-Compact POMDPs [1, 4, 46]. These two models are the base models used for evaluating the proposed modeling, policy estimation, and adaptation and refinement techniques presented in this thesis. The multi-lane freeway environment is simulated using VTK in Python 2.7. Each traffic lane in this simulation has a unique traffic distribution specified using parameters, such as max speed (e.g. L 1 = 11mph, L 2 = 13 mph), randomized distances between cars, and lane population (number of vehicles in each lane). The traffic distribution in each lane changes randomly based on the defined parameters for each lane, but none of the vehicle agents in the environment demonstrate risky behaviors, such as cutting off to other lanes (no lane changing from other vehicle agents happen in this simulation). Figure 2.5 provides a snapshot of the simulated environment in Python. In this simulation, high-level traffic patterns (density and flow of traffic) 43 Figure 2.5: Overview of the simulated multi-lane freeway environment in Python VTK. A V (green rectangle) is initially parked in the parking lane associated with each individual lane are modeled (and explained) using a multi-variate Gaussian distribution as shown in equation 2.7. f(d;V) N(m;S) m =(m d ;m V );S= (s 2 d ;rs d s v );(rs d s v ;s 2 v ) (2.7) The overall objective for the A V in this environment is to smoothly maneuver and avoid lanes with dense and slow traffic to achieve a pre-defined max speed while avoiding collisions. Based on the defined objective, environment, and available POMDP models (for lane-changing and lane- keeping), the goal is to simultaneously execute one of the POMDP models based on the identified strategy (i.e., high-level maneuver plans: continue driving in the current lane or change lanes to the adjacent lane) for optimal maneuver planning. The high-level plans are obtained by comparing the traffic distributions from each lane. For instance, if the mean distance and velocity for traffic 44 in the adjacent lane is larger than the mean distance and velocity in the current lane, the high-level plan becomes looking for possible gaps in the adjacent lane for initiating lane-changing and the lane-changing POMDP will be invoked for maneuver planning and decision-making. Figure 2.6 (left) provides an example of normal distributions obtained from normalized traffic data in lanes 1 (A V’s current lane) and 2 in the simulation [1, 4, 46]. As shown in the figure, traffic in the second lane has a higher speed and lower density (distances between vehicles and number of vehicles) compared to lane 1 as the Gaussian distribution is in the right-top corner of the plot. Moreover, while the speed in the first lane varies in a broad range (distribution stretched out towards the y-axis), vehicles in the second lane have similar speeds. Traffic density measured based on gaps between cars is also represented using a heat-map at right, which shows that the second lane has a smoother traffic (low traffic density). The sources of uncertainties addressed in the POMDP Figure 2.6: Sample traffic distribution for lane changing shown using Gaussian distributions ((rd d d V ) not shown) at left and heat-map of traffic density at right formalization are associated with noisy data and hidden intentions of drivers. The observation- space is defined using parameters, such as: time-to-collision (TTC), A V’s size (len), time-to-catch up (TCU) depending on A V’s max acceleration, and A V’s vicinity. TTC (seconds) is calculated by dividing the distance between A V and cars in the vicinity in front (behind) by A V’s speed relative 45 to the same cars (equation 2.8). T TC A Vk = k(x k ;y k )(x A V ;y A V )k V A V V k (2.8) Based on calculated TTC, cars can be identified as “far-away” if not in vicinity, “nominal/safe” if TTC 5 sec, “close” if 0:5< TTC 5, and “failure/ collision” if TTC 0:5. Similarly, the cars in the adjacent lane(s) are identified as “far-away” if not in vicinity, “rear-adjacent-close” if the distance between A V and the cars in behind is strictly less than d 1 where d 1 = w 1 len+ w 2 relative speedTCU, “front-adjacent-close” if A V’s speed is greater than average speed of car(s) in front or relative distance is less than d 2 where d 2 = w 3 len, and “nominal/safe” otherwise. The weights (w i ) can be adjusted to change aggressiveness of the A V during lane-changing. A decision- tree whose parameters are turned in the simulation is employed for finding clusters/classes of similar events from observations. Figure 2.7 provides the identified clusters for the lane-keeping model. In this simulation, the combinations of TTCs, relative velocities, and relative distances Figure 2.7: Individual clusters/classes of observations obtained from simulation represent states for the lane-keeping model that are not addressed in the observation-space are identified as potential missing values from the observation-space. The action space is defined as: Maintain status quo (do nothing), Speed up (+1 46 m s 2 ), Slow down (-1 m s 2 ), Change lanes (left or right), and Stop. The steering angle for changing lanes is obtained based on the target location on the adjacent lane, A V’s current location, and its aggressiveness. All the actions are employed in the lane changing expandable-POMDP, while only the first three actions are used for lane-keeping. The developed Expandable-Compact POMDP models can be summarized as follows: State-space (S DL ) an action-space (A DL ) of the lane-keeping model: • s 0 : Driving slower than traffic (Transient: R(s 0 )=+1) • s 1 : Driving faster than traffic (Transient: R(s 1 )=+1) • s 2 : Steady (nominal/safe) (Goal: R(s 2 )=+10) • s 3 : Failure/collision (Failure: R(s 3 )=20) • a 0 : Maintain status quo • a 1 : Speed up (+1 m s 2 ) • a 2 : Slow down (1 m s 2 ) State-space (S LC ) an action-space (A LC ) of the lane-changing model: • s 0 : Safe to change lanes (Goal: R(s 0 )=+10) • s 1 : Not beneficial (low speed) (Transient: R(s 1 )=+1) • s 2 : Not beneficial (high speed) (Transient: R(s 2 )=+1) • s 3 : Unsafe to change lanes (Transient: R(s 3 )=+1) • s 4 : Failure/collision (Failure: R(s 4 )=20) • a 0 : Maintain status quo • a 1 : Speed up (+1 m s 2 ) 47 • a 2 : Slow down (1 m s 2 ) • a 3 : Initiate lane changing • a 4 : Stop (deterministic action) The most expected outcome heuristic is initially employed during the early experiments in this thesis for expanding the state and observation spaces and estimating underlying dynamics. Later (in chapter 4), the similarity-based expansion and data-driven refinements are employed to adapt and refine both lane-keeping and lane-changing models to more complex and risky environments. 2.5 Summary and Conclusion This chapter focuses on the POMDP model definition and formulation problem and provides an overview of the related work. Specifically, various applications of path planning, navigation, mo- tion planning (operational level), and maneuver planning (tactical level) in various use-case sce- narios (e.g., intersection crossing, autonomous driving in crowd, lane changing, search and rescue, and target search) are specified. In addition, the joint-state (continuous state-space) and grid-based state-space representations for formalizing decision-making and planning problems as POMDPs are reviewed and the addressed sources of uncertainties are identified. The existing limitations and gaps associated with motion planning using joint-state state-space and path planning using grid- based state-space representations are mainly due to the size and complexity of the defined model (i.e., scalability issues result from employing a large state-space). Moreover, majority (if not all) of the related work focus on defining and formulating models for a specific use-case scenario and environment by assuming that all information required for performing robust and accurate decision-making and planning are available as a-priori at the outset [1, 4]. By relying on this assumption, the available work designs/defines models with a fixed set of state, observation, and action spaces and do not discuss model adaptability or refinement. To enable POMDP modeling for real-world problem-domains and systems, the scalability and 48 adaptability are needed to be addressed in POMDP formulation and model definition. For this purpose, the ”Expandable-Compact” POMDP modeling technique is presented, where the term ”Compact” implies that the state-space uses a compact representation of states (i.e., employing data analysis, classification and clustering to identify distinct groups/patterns of observations with similar behavioral patterns and using distinct clusters to represent states), which drastically reduces the size and complexity of the model (i.e., transition and emission matrices instead of 2D dynamic motion equations and reward vector (with relative reward values assigned to states depending on their categories (goal, failure, or transient) instead of sophisticated reward formulation and refine- ment). On the other hand, the term ”Expandable” implies that the state, observation, and action spaces and underlying dynamics can be gradually expanded using data or model-driven heuristics to gradually account for new (possibly missing) states, observations and dynamics. The 2.4 section in this chapter focuses on developing scalable, adaptive POMDP models for safety-critical applications of A Vs and defines two models for safe and smooth lane-keeping and lane-changing in a multi-lane freeway environment using Expandable-Compact POMDP formula- tion. To define the state-spaces for these models, data (i.e., partial observations) is collected from the simulated environment (simulation implemented using Python VTK) and a decision-tree (pa- rameters fine-tuned in simulation) is implemented to find distinct classes (groups) of observations. The expandability (adaptability) in these models is enabled by estimating the underlying dynamics for possible new states using a model-driven heuristic approach, Most Expected Outcome, where the dynamics (transition and emission probabilities) for new states and available actions are initial- ized by analyzing the effect of different actions in state transitions. These two models are the base models defined in this research to experiment with policy estimation, adaptation, and refinement. 49 Chapter 3 Policy Estimation in POMDPs The overall goal in POMDP models is associated with finding an optimal mapping between the probabilistically inferred states (i.e., beliefs b t 2 B) and the action-space, such that the sum of collected reward over a finite time-horizon is maximized [1, 4, 28, 18, 51]. The series of ac- tions/decisions to be performed at each belief that maximizes the long-term sum of expected re- wards is known as ”optimal policy” and is denoted as p : B! A. For this purpose, there exists various offline and online algorithms that use approximations and heuristics to optimize the process of searching for optimal policies. This chapter in this thesis focuses on the policy estimation prob- lem, reviews the available techniques and algorithms (including the limitations and advantages), and provides an online, adaptable policy estimation technique and algorithm. Finally, the policy estimation algorithm is employed for estimating policies for the lane-keeping and lane-changing models, where the performance and accuracy is evaluated and compared to other decision-making techniques within a simulation environment. 3.1 Theory and Mathematical Formulation In general, continuous decision making and planning using POMDPs (after the model is defined and formulated and deployed in the environment) is associated with iteratively and continuously performing the following steps: 1. Inferring belief b t based on performed action a t and obtained observation o t at time-step = t 50 2. Estimating/Finding the next optimal action/decision for the inferred belief 3. Executing the action in the environment Figure 3.1 presents an overview of this iterative process. As discussed in the previous section, Figure 3.1: Overview of continuous decision-making and planning using POMDPs; h is a normalization factor such thatS s2S b t+1 (s)= 1 inferring belief from partial observations can be obtained using the Bayes’ equation 2.2. The next step is to choose/find an optimal action/decision based on the inferred belief. This action is determined by the agent’s policy p, specifying the probability that the agent will execute any action in any given belief, i.e.,p defines the agent’s (system’s) strategy for all possible situations it could encounter [50]. The optimality criterion (for obtaining optimal policies) is to maximize the expected sum of rewards (also known as the return or discounted return) over a finite time-horizon. Formally, the return/reward obtained by following a specific policy p, from a certain belief b t , is defined as shown in equation 3.1. V p (b t )=S a2A p(b t ;a) R(b t ;a)+gS o2W pr(ojb t ;a)V p (b t+1 jb t ;a;o) (3.1) where 0<g 1 is the discount factor that ensures large rewards are collected as early as possible, p(b t ;a) is the probability that action a will be executed in belief b t and R(b t ;a) is the immediate 51 reward associated with performing action a in belief b t and is formally defined as shown in equation 3.2. R(b t ;a)=S s2S b t (s) R(s;a) (3.2) The obtained value can be maximized (i.e., optimal value for a belief) by following the optimal policyp from a belief. The optimal value function V of the optimal policyp is the fixed point of Bellman’s equation [66] and is defined as shown in equation 3.3. V (b)= max a2A R(b;a)+gS o2W pr(ojb;a)V (t(b;a;o)) (3.3) where t(b;a;o) identifies the obtained updated belief from performing action a in belief b and receiving observation o. Formally, the optimal policyp can be defined as shown in equation 3.4. p = argmax p2P E S T t=0 g t S s2S b t (s)S a2A R(s;a)p(b t ;a)jb 0 (3.4) Another useful quantity is the value associated with executing a given action a in a belief b, which is denoted by Q-value and is formally defined as provided in equation 3.5. Q (b;a)= R(b;a)+gS o2W pr(ojb;a)V (t(b;a;o)) (3.5) Q (b;a) determines the value of executing action a by assuming that the optimal policy is followed [50]. The following section in this chapter provides an overview and literature review of various offline and online techniques and algorithms employed for estimating optimal value V and policy p , for which the above equations are employed for guiding the search by developing heuristics (in online techniques) of approximated to find offline policies. 52 3.2 Policy Estimation Using Offline and Online Techniques and Algorithms Many planning and control problems can be modeled as POMDPs, but very few can be solved ex- actly because of their computational complexity: finite-horizon POMDPs are PSPACE-complete [48, 67] and infinite-horizon POMDPs are undecidable [49, 50]. For this purpose, many approxi- mation algorithms and techniques have been developed that can be classified into 2 distinct groups: 1- Offline algorithms, and 2- Online algorithms. Offline algorithms calculate the optimal policy, the best action to execute for all possible situ- ations, prior to execution by evaluating all possible belief-action pairs. While approximate, offline algorithms can achieve very good performance, they often require significant amount of time for solving large POMDP problems, where there exist too many possible situations to enumerate and plan for. In addition, offline algorithms do not account for changes in the environment. In other words, small changes in the environment’s dynamics require recomputing the full policy from scratch [22]. On the other hand, online algorithms try to circumvent the complexity of computing a policy by planning online when model is deployed in the environment (i.e., agent/system inter- acts with its environment). Online algorithms find optimal policies by estimating the best policy (locally) based on only the most recent information (i.e., belief) [1]. Online algorithms are some- times also called agent-centered search algorithms [55]. The major difference between online and offline algorithms and techniques is that while offline algorithms compute an exponentially large contingency plan considering all possible situations. Online algorithms only consider the current situation and a small horizon of contingency plans. Some of the online algorithms are capable of handling changes in the environment without requir- ing more computation, which allows online algorithms to be applicable in many contexts where offline approaches are not sufficient or applicable. One of the examples of changes in the envi- ronment is associated with the changes in tasks that are needed to be accomplished, as defined 53 within the reward function, in the environment. One drawback of online planning is that it gen- erally needs to meet (near) real-time constraints posed by the environment, thus greatly reducing the available planning time, compared to offline approaches. The existing real-time constraints in online, continuous planning using POMDPs also result in accuracy-computation time trade-offs in online solvers [1]. Figure 3.2: Visual comparison of policy estimation and execution using offline (top) and online algorithms (bottom) 3.2.1 Offline Algorithms POMDPs can be optimally solved for a specific finite time-horizon by employing iterative tech- niques and algorithms, such as value iteration algorithm, where this algorithm uses dynamic pro- gramming to calculate accurate values associated with each belief by continuously updating the estimated value of a belief based on values obtained in previous iterations (i.e., Bellman’s prin- cipal of optimality [66]). In other words, the value function at horizon t is constructed based on value function at horizon t 1 and can be calculated using equation 3.6. V t (b)= max a2A S s2S R(s;a) b(s)+gS o2W p(ojb;a)V t1 (b 0 jb;a;o) (3.6) 54 Optimal policy associated with each belief then can be obtained from the value function using equation 3.7. p t (b)= argmax a2A S s2S R(s;a) b(s)+gS o2W p(ojb;a)V t1 (b 0 jb;a;o) (3.7) where equation 3.7 assigns exactly one action to a specific belief, and therefore should be calcu- lated for all possible beliefs. Optimal value function of a finite-horizon POMDP can be represented by a set of hyperplanes, so-calleda-vectors, where each of these hyperplanes define a linear value function over the belief space associated with action a2 A [52]. Researchers in [53] and [54] have used this technique to develop exact value function algorithms. Due to the high complexity of exact solving approaches (e.g. number of a-vectors required for representing the value function grows exponentially at each iteration of the value iteration al- gorithm), development of approximate offline approaches that can be applied to larger problem domains have been the focus of researchers in this field. Moreover, approximate offline algorithms are also employed in online algorithms, to compute lower and upper bounds on the optimal value function to orient the search in promising directions by applying ”branch-and-bound” pruning techniques. Examples of offline solver algorithms are discussed below. Blind Policy: A blind policy is a policy where the same plan/action is always executed, re- gardless of the underlying situation (belief). The value function of any Blind policy defines the lower bound on optimal value (V ) since it corresponds to the value of one specific policy that the agent (e.g. system) could execute in the environment. The value function (equation 3.3) in offline techniques is specified by a set ofa-vectors, where each vector defines the long-term sum of expected reward for following its corresponding blind policy. The a-vectors can be computed as shown in equation 3.8. a a t+1 = R(s;a)+gS s 0 2S T(s;a;s 0 )a a t (s) (3.8) 55 wherea a 0 = min s2S R(s;a)=(1g). After calculatinga-vectors, equation 3.8 is used for calculating the lower bound on the value of a belief state. V t (b)= max a2G t S s2S a(s) b(s) (3.9) whereG t = a 0 ;a 1 ;:::;a m is the set of hyperplanes (a-vectors), each of which defines a linear value function over the belief space associated with some action a2 A. The lower bound obtained using this technique can be computed very quickly, however, it is not tight enough to provide suf- ficient information. Point-Based Algorithms: Point-based methods are appropriate for obtaining tighter, more informative lower bounds that can be used for branch-and-bounding techniques. Point-based al- gorithms approximate the value function by updating it for some selected beliefs. These methods sample beliefs by simulating various random interactions between the agent (i.e., system) and POMDP environment and update the value function for those sampled beliefs [50, 68, 69]. In other words, the point-based solver methods circumvent the complexity associated with offline, exact algorithms by sampling a small belief set. There exist various algorithms that are developed based on point-based approaches, such as Point-Based Value Iteration (PBVI) [70], Persus [71], Heuristic Search Value Iteration (HSVI) [68], GapMin [72], Point-Based Error Minimization Al- gorithm (PEMA) [73], and Forward Search Value Iteration (FSVI) [74]. MDP and QMDP: The MDP approximation approach approximates the value function V of a POMDP model based on the value function of the underlying MDP model [75]. The resulting value function creates an ”upper bound” on the POMDP value function and can be calculated using Bellman’s equation as shown in equation 3.10. V MDP t+1 = max a2A R(s;a)+gS s 0 2S T(s;a;s 0 )V MDP t (s 0 ) (3.10) Then, the value associated with a belief, ˆ V(b) can be obtained as ˆ V(b)=S s2S V MDP (s)b(s). The QMDP approximation approach considers that all partial observability disappear after a 56 single step. It assumes the MDP solution is computed to generate V MDP t , based on which Q-values can be obtained using the equation 3.11. Q MDP t+1 = R(s;a)+gS s 0 2S T(s;a;s 0 )V MDP t (s 0 ) (3.11) This technique defines ana-vector for each action and provides the upper bound on optimal value function V . This upper bound is tighter compared to the upper bound obtained from MDP ap- proximation. Similarly, to obtain the value associated with each belief equation 3.8 can be used [50]. Fast Informed Bound (FIB): As mentioned before, the upper bounds obtained using MDP and QMDP approximation approaches do not account for partial observability. Fast Informed Bound or FIB is another approximation technique for computing upper bounds on the value function, which is able to take into account the partial observability of the environment [76]. Thea-vector update process using this approach is provided in equation 3.12. a a t+1 (s)= R(s;a)+gS o2W max a2G t S s 0 2S O(s 0 ;a;o) T(s;a;s 0 )a t (s 0 ) (3.12) wherea a 0 is initialized based ona-vectors computed using the QMDP approach. The upper bound provided by this approach is tighter (more informative) compared to upper bounds calculated using MDP and QMDP approaches. 3.2.2 Online Algorithms Online algorithms, in general, are better alternatives for policy estimation in large, more complex POMDP problems, where offline algorithms can take minutes to hours to estimate optimal policies. Online algorithms try to find a good local policy for the most recent belief information that is obtained based on the belief at the previous time-step and most recent observation received from the environment. The advantage of online approaches is that they only need to consider beliefs that can be obtained/computed (i.e., reachable) from the current (most recent) belief [50]. One major 57 difference between offline and online algorithms is that since online planning should be performed at every time-step (when there is a new observation from the environment to update the belief based on), it is sufficient to calculate only the maximal value for the most recent belief, not the full optimal a-vector (i.e., greedy approach). In addition, the overall time for policy construction/estimation and execution is in general less for online approaches [50]. In general, online algorithms are consists of 2 phases: planning (estimation) and execution (as shown in figure 3.2). During the planning phase, the online algorithm receives the current (most recent) belief information about the agent (system)-environment interaction and builds a tree (so-called belief-tree) of reachable beliefs (obtained based on current belief) by looking at several possible sequences of actions and observations (that can be taken from the current belief) ahead. The current belief becomes the root node of the constructed tree and all belief nodes that are calculated in lower levels of the tree become the children of nodes located in higher levels of the tree (figure 1.4). After constructing the belief tree, the value for the current belief is estimated by propagating the value estimates in lower branches up to their ancestors, all the way to the root of the tree [1]. Once the planning phase ends, the execution phase proceeds by executing the best action (i.e., action with the largest estimated value) computed for the most recent belief [50, 1]. One of the challenges associated with online approaches is that as the tree grows deeper (i.e., looking far ahead in the future), the number of possible nodes and various branches (in lower levels of the tree) grow exponentially ((jWjjAj) treeheight ) for each belief node), thus the search time associated with a full tree increases exponentially as the search investigates deeper levels in the tree for obtaining the more accurate values associated with different actions [1]. This implies that there exists a trade-off between planning time and accuracy of the calculated value/estimated policy. To address this issue, most of the online algorithms focus on reducing the number of reachable beliefs that are explored in the tree. These approaches mainly differ in the choice of next node for expansion and the employed expansion mechanism. Existing approaches can be divided into 3 major categories: • Branch-and-Bound Pruning • Monte Carlo Sampling 58 • Heuristic Search Branch-and-Bound Pruning: This technique prevents the expansion of unnecessary lower nodes in the constructed belief tree that can lead to sub-optimal solutions by pruning those nodes from the tree. To achieve this in the belief tree, a lower bound and upper bound are maintained on the value Q (b;a) of each action a, for every belief node b in the belief tree. These bounds are computed by first evaluating the lower and upper bounds for leaf nodes (fringe nodes) and then propagated to parent nodes based on the following equations: L T (b)= 8 > > < > > : L(b) i f b2 F(Tree) max a2A L T (b;a) otherwise (3.13) L T (b;a)= R(b;a)+gS o2W p(ojb;a) L T (b 0 jb;a;o) (3.14) U T (b)= 8 > > < > > : U(b) i f b2 F(Tree) max a2A U T (b;a) otherwise (3.15) U T (b;a)= R(b;a)+gS o2W p(ojb;a)U T (b 0 jb;a;o) (3.16) where F(Tree) shows the set of fringe nodes in the tree, U T (b) and L T (b) represent the upper and lower bounds on V (b) for belief b in the tree respectively, U T (b;a) and L T (b;a) represent the bounds on Q(b;a), and U(b) and L(b) are the bounds for the fringe nodes, which are typically computed offline. Offline techniques, such as MDP, QMDP, and FIB can be employed for calculat- ing the bounds for fringe nodes in the tree. Based on the calculated bounds, if a given action a i 2 A in a belief b has an upper bound U T (b;a i ) that is lower than the lower bound L T (b;a j ) associated with another action a j , then a j is guaranteed to have a value Q (b;a j ) Q (b;a i ). This implies that a i is sub-optimal in belief b (should be pruned from the tree). Real-Time Belief Space Search (RTBSS) is an algorithm that uses branch-and-bound pruning technique [77, 78]. 59 Monte Carlo Sampling: As mentioned before, expanding the belief tree based on all possible observations is infeasible except for shallow depths, since the number of the nodes in deeper levels grow exponentially. An alternative approach is to sample a subset of observations at each expan- sion and only consider the belief nodes that can be reached based on the sampled observations [79]. This helps with reducing the branching factor of the search and allows searching deeper levels in the tree [50]. This strategy is employed by Monte Carlo sampling-based algorithms. McAllester and Singh, Rollout, Determinized Sparse Partially Observable Trees (DESPOT), POMCP, and Sparse Sampling (SS) algorithms are examples of algorithms that us Monte Carlo sampling to re- duce the size of belief tree and speed up the search. McAllester and Sing algorithm employs a depth-limited search (fixed horizon D) where instead of exploring all possible observations for each individual action, only C observations are sampled from a generative model [80]. On the other hand, the Rollout algorithm starts the search based on an initial policy that is computed offline. At each time-step, the algorithm estimates the future expected value associated with each action, assuming that the initialized policy is followed in the future steps, and finally executes the action with the highest estimated value. The trajectories in the belief tree are generated by first taking the action to be evaluated, and then following the initial pol- icy in the subsequent beliefs, assuming that observations are sampled based on a generative model [81]. On the other hand, the DESPOT algorithm constructs a belief tree that contains all the action branches, but only the observation branches that are encountered under the sampled scenarios are addressed in the tree. A Determinized Sparse Partially Observable Tree (DESPOT) is construc- tively defined by applying a deterministic simulative model to all possible action sequences under K sampled scenarios, where a scenario is an abstract simulation trajectory with some starting state s 0 [82]. POMCP algorithm combines optimistic action exploration and random observation sam- pling to perform Monte Carlo search in a belief tree [79]. Similar to DESPOT, the sparse sampling (SS) algorithm introduces sparse approximations to a belief tree and samples a constant number C of observation branches for each action [83]. Heuristic Search: An alternative approach to Branch-and-Bound pruning or Monte Carlo 60 sampling is the Heuristic Search algorithm(s). These algorithms try to focus the search on the most relevant reachable beliefs by using heuristics to select the best belief node(s) to expand. The most relevant reachable beliefs are those belief nodes that allow the search algorithm to make good decisions as quickly as possible [50]. Satia and Lave [84], BI-POMDP [85], and AEMS [86] are examples of algorithms that employ heuristic search, where they mainly differ in the specific heuristic employed to choose the next belief node to expand in the belief tree. 3.3 N-Step Look-Ahead, An Online, Adaptive, Context-Based Policy Estimation Algorithm As discussed above, the challenge associated with offline algorithms is that although these algo- rithms are able to accurately calculate optimal policies, they usually take hours to achieve this goal and they do not account for changes in the environment. On the other hand, most of the online al- gorithms explained above, rely on some calculations (e.g. lower/upper bounds) to be done offline, and these algorithms assume that the state-space, observation-space, and model probability distri- butions/ functions (e.g., transition and emission probabilities) are fixed. This implies that these algorithms are not able to respond to observations that are not part of the observation-space (i.e., unknown observations). To address these challenges and account for scalability in policy estima- tion to achieve near real-time performance, the online, adaptive N-Step Look-Ahead algorithm is implemented and presented in this thesis. The N-Step Look-Ahead algorithm is an online algorithm that estimates the best online policy for a belief within a sense-plan-act cycle given the real-time constraints posed by the environment. This algorithm, similar to other online algorithms, estimates/constructs a belief tree of future be- liefs from the current belief, recursively explores (left to right, bottom to top) the resulting beliefs to a depth N, calculates the expected belief-action values for various paths in the tree, propagates the value from fringe nodes to ancestors in the belief tree, and finally, returns the estimated long- term reward for the current belief and action. To address the issue associated with exponential 61 growth of nodes in the tree O((jAjjWj) N ) (i.e., performance-time trade-offs), this algorithm em- ploys heuristic search with distance-based clustering, an unsupervised learning technique, to prune sub-optimal branches (i.e., branches with low estimated values and/or low probabilities) and limits the number of belief nodes to be expanded and explored. The main assumptions associated with this algorithm are as follows: 1- Given there is no a-priori information (e.g., offline calculations or sampled scenarios), heuristics can be defined by finding (probabilistic) dependencies between model dynamics and estimated values; 2- Distance-based approximations can be employed due to a compact state-space. Although the main purpose of implementing this algorithm is associated with policy estimation in Expandable-Compact POMDP models, the last section in this chapter provides the experimental results for policy estimation using this algorithm in a grid-based POMDP model (employed for path planning and navigation). In this experiment, the approximations that require employing a compact state-space are discarded and parallel processing is employed to reduce the policy estimation time. To implement this algorithm, various experiments using statistical data analysis are employed to identify efficient pruning and heuristic techniques. For instance, initially a full belief-tree is con- structed and explored to obtain the average execution time, value, and probability for each belief- action node. The key findings from the conducted experiments and data analysis techniques (e.g., average time and value for beliefs with similar probability distributions, effect of high estimated value with low probability, and average value associated with beliefs with high belief probability assigned to a failure state) are summarized as follows: • Probability of different branches in belief tree depend on emission probabilities • Probability of achieving a reward depend on the probability of transitioning to a certain belief after an action is performed • Branches populated with beliefs with high probabilities assigned to failure states lead to low estimated values • Rewards/values with low probabilities do not impact the overall estimated value 62 • Differences in estimated values for beliefs with similar probability distributions are negligi- ble (e.g.,jV(b = [0:1;0:25;0:55;0:1])V(b = [0:09;0:23;0:57;0:11])je) Based on the key findings from the experiments, certain approximations are introduced to address the performance-time trade-offs in online policy estimation, which are summarized as follows: • Expand belief nodes based on reachable observations only (Heuristic) • Expand non-terminal belief nodes only (Heuristic) • Expand belief node if value for similar belief(s) not computed previously (Clustering and Memory) In this algorithm, the heuristic search expands each belief node only based on a set of reachable observations,W + 0 (W + , associated with the belief-action pair. Reachable observations, a subset of W + , are defined as observations with high probabilities for a belief and action pair. The probability of an observation o2W + given belief b t at time t and action a2 A can be calculated as shown in equation 3.17. p(ojb t ;a)=S s2S +b t (s) p(ojs;a) (3.17) By considering only reachable observations for a belief-action pair, the belief expands to most probable belief nodes only, while the nodes with low probabilities are removed. Exploring reach- able observations improves the runtime of the algorithm but reduces the accuracy of the estimated long-term reward, resulting in a trade-off between the accuracy and runtime of the algorithm. One way of identifying reachable observations,W + 0 W + , in a problem is by considering the density (or sparsity) of the emission matrix. Large number of small emission probabilities (e.g. less thane = 0.01) implies that only a small number of observations can be received in a state after performing an action. However, if the POMDP model is relatively large (e.g. number of states) with a dense emission matrix, a larger set of reachable observations is required for better accuracy. In addition to guiding the search using reachable observations, the heuristic search also prunes the terminal (e.g. failure) nodes in the look-ahead search. In other words, the heuristic search expands a belief if 63 and only if the current immediate reward associated with a belief node(S s2S +b t (s)R(s)), is within a pre-defined value range (alternatively, b t (s f ailure ) pr f ailure ), which implies that the algorithm only expands non-terminal belief nodes and prunes branches resulted from terminal nodes (i.e. eliminating sub-optimal solutions). In addition, online distance-Based clustering technique is implemented and integrated with the N-Step Look-Ahead algorithm to identify and cluster belief nodes based on previously seen and explored beliefs using the distance between the belief node and centroids (mean) of available clus- ters. Basically, the online clustering (and memory employed to store available clusters and their estimated values) avoids initialization and expansion of redundant belief nodes to achieve near real-time performance. As shown in the pseudo-code provided in algorithm 1, the look-ahead function that gets in- voked at every time-step within a sense-plan-act cycle, receives a belief b t , number of steps N, discount factor (gamma g) , and two empty stack variables bStack and vStack for storing the be- lief clusters and calculated values, respectively. As the algorithm evaluates various beliefs in the tree, previously seen belief clusters are created with their centroids stored in a variable (bStack). Meanwhile, the estimated long-term reward for each centroid at different levels in the tree l N are calculated and stored in another variable (vStack). During the tree expansion and exploration process, estimated belief nodes in the tree are evaluated with respect to existing clusters using Eu- clidean distance. If the Euclidean distance between any cluster centroid and the belief is within a pre-defined radius (e.g. 0.1), the belief is added to that cluster and the centroid is updated. Then the algorithm searches for estimated long-term reward (value) given the current level in the be- lief tree. If the long-term reward exists in vStack, then the value is directly used for calculating the long-term reward for b t , and that belief node is not expanded (i.e., no further calculations are needed and approximate values are employed instead of exact values). Otherwise, that belief node is expanded so that the long-term reward associated with that belief can be calculated and stored in the vStack. On the other hand, if the belief cannot be assigned to any of the existing clusters, a new cluster is added to the belief stack and the belief node is expanded. The long-term rewards of 64 Algorithm 1 Pseudocode for N-Step Look-Ahead Algorithm 1: function NSTEPLOOKAHEAD(b t ;V T ;N;bStack;vStack;g) 2: depth N 3: if N= 1 then 4: if b t 2 bStack and V(b t )2 V stack then 5: V T V T +V(b t ) 6: if V(b t ) = 2 V stack then 7: prevV T V T 8: for8a2 A do 9: calculateW 0+ (b t ;a) 10: for8o2W 0+ (b t ;a) do 11: b t+1 = U pdateBelie f(b t ;a;o;T + ;O + ) 12: V T V T +(g depth S s2S + b t (s) pr(ojs;a)S s2S + R + (s) b t+1 (s) 13: if b t = 2 bStack then 14: insert b t to bStack 15: insert V T prevV T to vStack 16: return V T 17: if N > 1 then 18: if b t 2 bStack and V(b t )2 vStack then 19: V T V T +V(b t ) 20: if V(b t ) = 2 V stack then 21: prevV T V T 22: for8a2 A do 23: calculateW 0+ (b t ;a) 24: for8o2W 0+ (b t ;a) do 25: b t+1 = U pdateBelie f(b t ;a;o;T + ;O + ) 26: if b t+1 nonTerminal then 27: V T V T + g S s2S +b t (s)pr(ojs;a) S s2S +R + (s) b t+1 (s) + NST EPLOOKAHEAD(b t+1 ;V T ;N 1;bStack;vStack;g) 28: else 29: V T V T +g depthN S s2S +R + (s)b t+1 (s) S s2S +b t (s) pr(ojs;a) 30: if b t = 2 bStack then 31: insert b t to bStack 32: insert V T prevV T to vStack 33: return V T 65 the expanded nodes and the belief are then added to the vStack and bStack, respectively. The N-Step Look-Ahead algorithm, in a loop over all actions, initially expands the current be- lief, b t , by considering an action a2 A and all reachable observations8o2W + 0 associated with that action and belief that results in estimated beliefs in the next level of the tree. Before considering other actions for b t , the look-ahead search is invoked for all the belief nodes at a specific level in the tree and the recursion is continued at a branch until depth is N. When the depth is N, the values for all considered paths (either calculated or retrieved from vStack) are available, so the long-term reward for that branch (and subsequent branches) can be calculated. Finally, when a branch is completely traversed, the same procedure is applied to evaluate other actions, until the long-term rewards are available for all possible sub-trees and branches. The best online policy obtained for a belief using this algorithm is the action that if performed, considering all possible observations, updates the next belief such that the known states with high rewards get higher belief probabilities and failure states with penalties receive lower probabilities. Based on this logic, the algorithm finds optimal policies for beliefs in which the belief distribution has a high variance (i.e., only some of the states are assigned with high belief probabilities) and near optimal policies (e.g. safest possi- ble action that avoids failure) for normally distributed beliefs, where limited information can be inferred from the belief distribution about the current state [46]. Similar to the Bellman’s equation, the algorithm accounts for early rewards using a discount factor, 0<g 1. Furthermore, in calcu- lating the reward for an action performed given a belief, the algorithm considers both the amount and probability, p(ojb;a), of achieving the reward. This accounts for the transitions that may result in a big reward in long-term value estimation, however, the probability of achieving the reward might be very low that the reward may never be achieved after the action is performed. Finally, the overall long-term sum of rewards for b t and action a can be calculated as shown in equation 3.18. In this equation, l+ 1 andW + 0 represent the next level in the tree(l+ 1 N) and reachable set of observations, respectively. V N (b t ;a)=S o2W + 0 p(ojb t ;a)gS a 0 2A V N l+1 (b a;o t+1 ;a 0 ) (3.18) 66 During the search, total number of expanded belief nodes decreases as the algorithm explores the leftmost branches and gradually moves to the rightmost ones, since more belief clusters are created, and more long-term rewards (estimated values) are calculated and stored. Figure 3.3 shows how the heuristic search combined with distance-based clustering prunes a full tree (left) and reduces the number of the beliefs to be traversed. Since the algorithm functions in a sense-plan-act cycle, Figure 3.3: Pruning the belief tree by heuristics and clustering technique. Pruned branches are identified by red cross marks a policy can be obtained anytime given the limited time available for policy calculation. Thus, N should be large enough to explore reasonably deep in the belief tree, and also small enough so that the execution time of the algorithm does not exceed a predefined time limit identified based on sampling rate, total delay time, and data pre-processing time. Depending on the available time, model size, and time complexity of the algorithm, N can be identified empirically or experimen- tally. Figure 3.4 provides a comparison of the algorithm’s average runtime for solving different POMDP models using different N values given that different methods are applied for guiding the look-ahead search. As it is shown in this figure, the average runtime of the algorithm is significantly reduced (exponential to almost linear) after the heuristic search and distance-based clustering tech- niques are applied. By analyzing the runtime of the algorithm using different N values for various 67 POMDP models, one can develop a lookup table for selecting the best N for their POMDP model. The main advantage of this algorithm is associated with its ability in employing adaptive, context- based heuristics, and generating online clusters for addressing the existing performance-execution time trade-offs in the look-ahead search without requiring any additional information from offline computations, which makes it readily applicable to different POMDP models. Moreover, since the algorithm is designed with a flexible number of look-aheads, it can adapt to various real-time constraints posed by the environment (e.g., different sampling rates). Figure 3.4: Average runtime of N-Step Look-Ahead (green) compared to look-ahead search with different filters for various POMDP models. Runtimes larger than 300 seconds are replaced with 300 and the plots are smoothed for better graphical representation (Hardware specs: Core i5 CPU @1.7GHz, RAM 6, Ubuntu 16.04) N-Step Look-Ahead algorithm calculates belief-action values for8a2 A within a loop. To further reduce the computation time (value estimation time for a given belief), parallel processing is implemented on top of this algorithm, so that the value associated with the belief and existing actions can be computed in parallel. To enable parallel processing, the ”multiprocessing” package (including Pool and Map functionalities)in Python is employed. Depending on the computation power and number of cores within a CPU, the policy estimation time using parallel processing can 68 be reduced tou T Total jAj . The pseudo-code provided in algorithm 2 shows how parallel processing is employed in N-Step Look-Ahead. Algorithm 2 Pseudocode for Parallel Processing in Python 1: function MULTIRUNWRAPPER(ARGS) 2: return NStepLookAhead(*args) 3: function PARALLELPROCESSING(b t ;bStack;vStack;A;g) 4: pool= multiprocessing:Pool(#processes) 5: initialize N 6: inputs 8k2f1;:::;jAjg;(b t ;N;bStack;vStack;a k ;g) 7: out puts pool.map(MULTIIRUNWRAPPER , inputs) 8: polict argmax(out puts) 9: return policy 3.4 Policy Estimation for Lane Keeping and Lane-Changing To evaluate the performance of the N-Step Look-Ahead algorithm in finding optimal, safe poli- cies, the algorithm was employed for estimating policies for the developed lane-keeping and lane-changing POMDP models while accounting for policies associated with possible new states and underlying dynamics. The experiment and integration process are implemented on a Core i5 Lenovo laptop (CPU @1.7GHz, RAM 6) and Ubuntu 16.04 operating system. The sampling rate of the simulation in this experiment is 0.1 (10 observations per second) and N = 4 look-ahead steps is employed for finding online policies, where N is selected experimentally with respect to the size and complexity of the models, sampling rate in the simulation, and available computation power. For comparison purposes and in order to verify the optimality and safety of estimated policies in the safety-critical applications of A Vs, the decisions resulting from the POMDP models and N-Step Look-Ahead algorithm are compared to a rule-based decision-making technique, which is implemented using decision-trees [33, 87] (parameters fine-tuned in simulation) and end-to-end (i.e., directly mapping observations to decisions) planning and decision-making technique imple- mented using deep learning [88] (a neural network model with 8 fully connected hidden layers 69 and dropout = 0.1 to avoid over-fitting - accuracy 0.96). The results of this experiment includ- ing decision-making using POMDP and N-Step Look-Ahead, end-to-end planning, and rule-based decision-making are presented in figure 3.5. The high peaks in the plot are associated with a 3 : Figure 3.5: Evaluating and comparing decision-making and planning for lane-keeping and lane-changing models using POMDP and N-Step Look-Ahead, end-to-end planning, and rule-based decision-making. X- axis identifies the time-steps in simulation and Y-axis represents the decisions (a 0 : maintain status quo, a 1 : speed up, a 2 : slow down, and a 3 : change lanes). Topmost section presents the belief probability associated with the s 0 : Safe to change lanes state2 S LC . change lanes action, where the changes in the belief of state s 0 : Safe to change lanes, are also pro- vided on top of the figure. As shown in this figure, the lane-changing model initiates the “change lanes” action when the associated state is assigned with a high probability ( 0.92 and 0.83). This implies that the POMDP ensures that the high-level plan of changing lanes continues for a period (i.e. observations persist), which means that the aimed gap in the adjacent lane remains open long enough so the A V can safely change lanes. As shown in figure 3.5, the number of high peaks in the neural network model (green) is relatively high compared to the number of high peaks obtained from the POMDP and N-Step Look-Ahead algorithm. This is mainly due to the unknown, previously unseen observations and the robustness of the neural network model. In other words, the neural network model initiates lane changing when a single observation obtained from the simulation indicates that there is a gap in the adjacent lane. Moreover, this decision is also 70 made mistakenly when the neural network model encounters observations that are not addressed within the training dataset (i.e., blind spots [89]), leading to failures and collisions. To further evaluate the performance of the neural network model compared to online decision making using POMDP models and N-Step Look-Ahead, the mean probability of transitioning to a failure state for these models is calculated from experimenting with 50 different lane changing scenarios. The calculated failure probability for the POMDP model is 0.04 which is strictly less than the failure probability of the neural network model, which is 0.19 (despite achieving a good prediction accu- racy). The pink line in this figure presents the decisions obtained from employing the rule-based technique in the simulation. As shown in the figure, the rule-based technique cannot respond to ob- servations/situations that are not addressed in the decision-tree, resulting in ”no decision”, shown by red arrows in the figure. In contrast with the neural network model and the rule-based tech- nique, policy estimation using N-Step Look-Ahead algorithm and Expandable-Compact POMDP models (lane-keeping and lane-changing) leads to safe maneuvers when encountered by unknown observations. Figure 3.6 provides the belief probabilities of expanded lane-keeping model and the resulting policy when a new state H 0 is added to the model and dynamics are expanded using the most expected outcome heuristic to account for unknown, new observations. Figure 3.6: Expanded belief of the lane-keeping model (left) and normalized belief-action values estimated by N-Step Look-Ahead (right) 71 As shown in this figure, the N-Step Look-Ahead algorithm chooses actions such that the belief of the safe state s 2 (green bar) increases over time until a known observation is obtained, and the decision-making is recovered. In other words, the identified plans based on the expanded model and the N-Step Look-Ahead algorithm guide the A V towards a steady and safe state by maintaining status quo until a known observation is obtained. 3.4.1 Offline versus Online Policies for Safety-Critical Application of A Vs To further evaluate the performance of the N-Step Look-Ahead algorithm in obtaining optimal policies, the online policies are compared to offline policies computed using a benchmark policy estimation technique, implemented using Q-learning algorithm. Q-learning is an algorithm for learning to act optimally in Markovian domains and it works by successively improving evaluations of specific actions at specific states [90]. In general, Q-learning is a model-free offline algorithm that computes optimal policies from trails and errors in the environment. Q-learning estimates the “utility” values of executing an action from a state by continuously updating the Q-values [91]. Q-values associated with performing an action a, at a specific state s, are updated during each “episode” (i.e., a specific path, series of states and actions, within a scenario) of Q-learning when action a is selected to be performed at state s as shown in equation 3.19. Q(s;a) t+1 = Q(s;a) t +a r+g max a 0 2A Q(s 0 ;a 0 ) t Q(s;a) t (3.19) where Q(s;a) t represents Q-value associated with state s and action a, Q(s 0 ;a 0 ) t shows the Q-value associated with performing a 0 at s 0 , where s 0 is the next state that results from transitioning from s after performing a, r denotes the immediate reward of performing a at s (R(s;a)), and 0 > > < > > > : obs= 1 (1b h )b t1 (s) (1b h )b t1 (s)+a h (1 b t1 (s)) obs= 0 b h b t1 (s) b h b t1 (s)+(1a h )(1 b t1 (s)) (3.23) where b t (s) shows the belief probability of state s at time-step t, obs = 1 indicates a location of interest is detected, and obs = 0 shows no indication of a location of interest. The reward function accounts for coverage, time, exploitation of regions with high probability of target, and cost associated with data transmission (i.e., action a 7 ). To address the search time and coverage associated with cells with higher belief probabilities (i.e., prioritizing regions with high target probability; locations/cells associated with flow channels in this mission), the reward function assigns a high reward value to cells with initial high belief probability (obtained from the belief-map at time-step t = 0) and a small reward to other cells in the search grid map as shown in equation 3.24. R(Coverage)= 8 > > < > > : s2 C t \ Reg high b t (s) R 1 (e:g:;R 1 =+10) s2 Reg allfC t \Reg high g b t (s) R 2 (e:g:;R 2 =+1) (3.24) 82 where C t shows the cells covered by the camera at time-step= t, Reg high shows regions with high probability, and Reg allfC t \Reg high g shows other regions. In addition, to make the helicopter exploit regions after a successful target detection based on images provided by the camera, a penalty term P 1 , is specified to avoid executing actions that change the region of the helicopter. The overall penalty value, however, depends on the number of successive target detections (i.e., belief probability of state associated with the location of the target on the grid-map) and location of the helicopter relative to the cell where a target is believed to exist. This penalty value is calculated as shown in equation 3.25. P 1 (dist)=b t (s= s targ ) 2 sum(k x heli ;y heli x targ ;y targ k)+ z heli h (3.25) where b t (s= s targ ) shows the belief probability of state from which the target is detected andh is a normalization factor. Finally, the cost associated with data transmission (a 7 ) is defined in equation 3.26. Cost(transmission)= b t (s= s targ ) R 3 +(1 b t (s= s targ )) P 2 (3.26) where R 3 > 0 and P 2 < 0 values can be selected to identify minimum number of successive target detections from a specific location before initiating action a 7 . After formulating the POMDP model and modeling the camera and environment, the search activity begins in the simulated environment (Python). The search policy (action) is estimated us- ing the N-Step Look-Ahead algorithm at every simulation time-step. In this algorithm, at every simulation time-step, the expected long-term sum of rewards over a finite time-horizon (specified by N) is calculated for the most recent belief (updated belief-map) and all possible actions and observations, and finally, the action with the highest reward value is selected as the search pol- icy at that time-step. The overall search time is affected by the runtime of the policy estimation algorithm, which depends on the size of the state-space, action-space, observation-space, and spec- ified lookahead steps (e.g., N). To account for limited computation power and reduce the overall policy estimation time, the N-Step Look-Ahead algorithm prunes the branches (estimated future 83 beliefs) that are not reachable from the current state, which implies that only beliefs associated with reachable observations (observations with high probabilities) and reachable states/cells (lo- cations of helicopter in the search area) are evaluated during the policy estimation. To evaluate the overall search time in our simulated search, different look-ahead steps, policy estimation with and without accounting for reachable observations, and parallel processing on a Lenovo laptop (Ubuntu 16.04, Core i5 CPU @1.7GHz, RAM 6 GM) are employed and the overall run-time of the algorithm is estimated. The results obtained from this evaluation are shown in figure 3.12. In Figure 3.12: Results of Evaluating N-Step Look-Ahead runtime using different N values, parallel process- ing, and reachable observations addition to the policy estimation time, the total search time also depend on the total number of actions/decisions required for surveying the area while performing active search, where the total number of actions/decisions depend on the number of look-ahead steps employed during policy estimation. To identify the most appropriate N value for the N-Step Look-Ahead algorithm, vari- ous look-ahead values are evaluated in a series of experiments and the results obtained from this 84 experiment are provided in figure 3.13. Based on these results, N = 3 is selected to perform online Figure 3.13: Total number of search actions resulting from following the search policies estimated using N-Step Look-Ahead algorithm with different N values decision-making while adhering to constraints posed by the system (available flight time and com- putation power). Given the reward function and the decisions from the N-Step Look-Ahead algorithm, the he- licopter takes off from the initial location and climbs to h = 3 meters where the decision instructs the helicopter to increase its altitude to maximize cell coverage. During this search, the Mars helicopter explores the search area while flying 5 meters above the Mars surface (maximizing cov- erage), until a pattern of interest is identified, where the helicopter decreases its altitude to exploit that location, thus high-quality images are captured from the exploration. Figure 3.14 (left) pro- vides the map of the area reconstructed from the images captured by the Mars helicopter, where 3 locations (identified by red squares) are associated with regions where patterns of interest were observed (initially identified on the POA map). The right figure demonstrates the heat-map associ- ated with this search, where areas with darker colors (higher density) show locations, from which multiple images are available. As shown in the heat-map, 3 regions with patterns of interest have the highest density resulting from exploiting these areas during the search. As shown by the results 85 Figure 3.14: Reconstructed map of the area and the heat-map associated with exploration and exploitation obtained from various experiments, the N-Step Look-Ahead algorithm can be easily employed for path planning and navigation in grid-based POMDP models and can adapt to different environment assumptions and system constraints (e.g. time limit) by employing a flexible number of lookahead steps (N) and reachable observations, where N and reachability can be identified experimentally. Moreover, to estimate search policies, this algorithm does not rely on prior offline computations, which also makes this algorithm easily applicable and adaptable to different environments. 3.6 Summary and Conclusion This chapter in this thesis focuses on the problem of policy estimation in POMDP models and pro- vides a detailed discussion and overview of the policy estimation theory and existing algorithms. Policy estimation for POMDPs is a challenging problem as POMDPs are P-SPACE complete. As discussed in the first and second sections in this chapter, existing policy estimation techniques can be categorized into two distinct groups that are 1- Offline algorithms and 2- Online algorithms. Offline algorithms provide policy before model execution and deployment in the environment by 86 evaluating all possible belief-action pairs and estimating belief-action values in an iterative manner (e.g., value iteration). Although offline techniques and algorithms can achieve high accuracy, they are not applicable to many POMDP problems as they typically take hours/days to estimate poli- cies. Moreover, since the policy is obtained before model execution, the policies do not account for changes in the environment (e.g., changes in problem objective or environment dynamics). The results of offline algorithms are usually employed by online techniques to define lower/upper bounds and guide the search towards optimal solutions. On the other hand, online techniques and algorithms circumvent the challenges associated with offline policy estimation by estimating poli- cies locally when model is deployed in the environment only for the most recent belief obtained from system-environment interactions. These techniques estimate the policy by constructing a belief-tree from the most recent belief and exploring various possible paths estimated from the most recent belief, available actions and observations. Finally, the action that results in the max- imum value is selected as the policy for the current belief (i.e., greedy approach). The accuracy of the estimated belief-action pair in the tree depends on the depth of the constructed belief-tree, implying that as the tree grows deeper (i.e., looking far ahead in the future) the accuracy of the estimated value increases. On the other hand, the number of branches and nodes in the lower lev- els of the tree grow exponentially leading to an exponential search time. This implies that there exists a performance-time trade-off in online algorithms that needs to be addressed to account for scalability of these algorithms and adhere to time and computational constraints posed by the en- vironment and system, respectively to achieve near real-time performance. There exist various techniques that use approximations for addressing the existing trade-off in belief-tree construction and exploration. These techniques can be categorized as follows: 1- Branch-and-Bound Pruning, 2- Monte Carlo Sampling, and 3- Heuristic search. All of these techniques use approximations (mainly obtained from prior offline calculations) to guide the search towards valuable branches and avoid sub-optimal solutions. The main advantage of online algorithms compared to offline techniques is associated with their low computation time and ability to adapt (to some changes in the environment). However, majority of these algorithms are configured for motion planning, path 87 planning, and navigation using POMDPs and the best solvers rely on samples and offline calcula- tions available a-priori. To address the existing challenges and account for policy estimation when model goes through changes, an online, adaptive policy estimation algorithm, N-Step Look-Ahead, is implemented and presented in this thesis (section 3). Similar to existing online algorithms, this algorithm esti- mates the value associated with the most recent belief (obtained from system-environment inter- actions during model execution in the environment) by construction a belief-tree and recursively traverses different paths to calculate belief-action values. To address the performance-time trade- off, various experiments are conducted, and data analysis techniques are employed to introduce approximations to guide the search. Based on the key findings from employing data analysis on experimental results, model and data-driven heuristics (i.e., expanding non-terminal belief nodes based on reachable observations) and distance-based clustering (using Euclidean distance) are em- ployed within the algorithm to prune sub-optimal solutions (i.e., branches with low probabilities and low rewards) and avoid redundant calculations, reducing the exponential search time to almost linear time complexity (achieving ear real-time performance - e.g., fractions of seconds). More- over, parallel processing is integrated with this algorithm so that the belief-action values for all existing actions can be calculated in parallel, reducing the total estimation time to T Total jAj de- pending on the available computation power and CPU. Section 4 in this chapter provides the results for computing online policies for the lane-keeping and lane-changing Expandable-Compact POMDP models. The estimated policies for these models are evaluated in simulation (simulated in Python VTK) with sampling rate of 0.1 Hz (10 samples per second), where the performance is compared to an end-to-end planning technique (imple- mented using a fully connected neural network model that directly maps observations to deci- sions) and a rule-based model (implemented as a decision-tree where parameters are fine-tuned within simulation). The results obtained from this experiment and comparison indicate superior performance (efficient, robust, and safe) achieved from estimating policies for POMDP models using the N-Step Look-Ahead algorithm. To verify the online policies, the estimated look-ahead 88 values (obtained from N-Step Look-Ahead algorithm) are compared to offline policies estimated using customized Q-learning algorithm. The results of this comparison indicate that the N-Step Look-Ahead algorithm can estimate optimal policies by looking only a few steps ahead in the future. Furthermore, to demonstrate the applicability of the N-Step Look-Ahead algorithm, this algorithm is employed for path planning and navigation in an active search problem, where the problem is formalized as a grid-based POMDP model. The results obtained from this experimen- tation indicate that the N-Step Look-Ahead algorithm can be easily employed for path planning and navigation in grid-based POMDP models and can adapt to different environment assumptions and system constraints (e.g. limited time and computation power) by employing a flexible number of lookahead steps (N) and reachable observations, where N and reachability can be identified ex- perimentally. Moreover, it is shown that this algorithm does not rely on prior offline computations, which also makes this algorithm easily applicable and adaptable to different environments. 89 Chapter 4 Adaptation and Refinement in POMDPs This chapter in this thesis mainly focuses on adaptation and refinement in POMDP models and policies with respect to new tasks, new environments, same environment with new dynamics, and/or new information obtained from the environment. As discussed before, majority of the POMDP-related problems assume that all the information required for formalizing a decision- making and planning problem as a POMDP is available at the outset, thus POMDP models are defined and formulated with a specific, fixed state, observation, and action spaces and pre-defined model dynamics[1, 4]. In other words, majority of POMDP models do not account for changes and new information that are revealed from system-environment interactions after the model is deployed in the environment, leading to partial models that are capable of handling only a subset of situations in the environment. In addition, ”refinement” of POMDP models in the available literature is mainly associated with fine-tuning model dynamics (e.g., reward/penalty values and transition, and emission functions/probabilities) in a simulation environment so that the behavior of the defined model and underlying dynamics closely fits the expected behavior of the system in the environment (e.g., [20]). There exists limited research for abstract POMDP refinement us- ing counterexample guidance, where the main focus is associated with simplifying verification of POMDP models [11]. To enable POMDP modeling for addressing planning and decision-making problems in real- world environments and systems, efficient adaptation algorithms and refinement techniques are required so that the model can efficiently and gradually account for new information and changes 90 observed from the environment, while maintaining enough size and complexity to achieve near real-time performance [7]. In other words, new techniques and algorithms are needed to scale up the models to become true and accurate representation of real system-environment interactions, instead of scaling down problems with strong assumptions. 4.1 Overview of Existing Adaptation Techniques One key aspect in achieving fast and efficient adaptation is the presence of appropriate learning structures and memory units. Therefore, standard model-free techniques (e.g., deep reinforce- ment learning) do not perform well because they are tabula-rasa systems and hold no knowledge in their architectures to allow for fast, targeted learning and adaptation [44]. On the other hand, model-based techniques, such as POMDPs, which hold knowledge of the system and environment interactions, allow for rapid and efficient adaptation [44]. Adapting POMDPs may require modifying the initial modeling assumptions and definition (e.g., adding new states or actions, updating observation functions, accounting for new state vari- ables, and revising the reward function) and updating the underlying dynamics, which requires estimating a new policy that addresses the changes in the model (and environment). Available methods and approaches for (policy) adaptation can be categorized as follows: 1. Re-planning 2. Accounting for all possible changes in model 3. Transfer knowledge and experience The first technique (re-planning) assumes that all information required for computing a new policy is available in the model and uses this information to estimate a new policy from scratch to address available changes. Re-planning is considered the basic method of adaptation, which is not efficient since all the experience and knowledge gained from deploying the prior model in the environ- ment is discarded when calculating a new policy. The second approach hedges over all possible 91 changes [58, 56]. It models all possible changes as a POMDP problem, where each parameter that may change is modeled as a ”state variable” of an enlarged POMDP. The enlarged POMDP can be solved using existing POMDP planners and the generated policy is optimal over all pos- sible changes in the POMDP model. But, the enlarged problem can quickly become very large, beyond the capability of even the best POMDP solvers today [57]. Finally, the third approach (also knowns as transfer learning (TL)) relies on reusing the available experience and builds upon the existing model or policy in order to adapt to a new environment or task [47]. The main idea behind transfer learning technique is that the experience and knowledge gained from performing a task in an environment (i.e., source task and environment) can be employed for performing a different, but similar task (or set of tasks) in a different, but similar environment. Transfer learning has been employed successfully in reinforcement learning and deep learning problems, where the learned features from an available dataset are employed for learning new information and features that are present in a secondary dataset (e.g., deeper layers and neurons of a deep learning model are frozen and weights associated with only a few layers and neurons located on top of a neural network model are updated to capture new information, without modifying the weights of deeper layers) [59, 60, 61]. Despite the major advances in transfer learning techniques and algorithms, gaps and limita- tions remain. The first limitation associated with transfer learning is that majority of the research focuses on developing transfer leaning techniques and algorithms for deterministic planning and decision-making problems that are typically formalized as MDPs (not POMDPs). The second lim- itation is that the available transfer learning techniques and algorithms for POMDPs only address policy adaptation and do not account for model adaptation. Last but not least, the main assump- tion in transfer learning is that some knowledge associated with differences/similarities between the source and target tasks/environments are available a-priori and this information is employed for defining a ”mapping function” that identifies how much of the gained experience and knowl- edge can be transferred to perform efficient adaptation. The techniques and algorithms that do not assume the differences/similarities are available a-priori, focus on online policy adaptation 92 by modifying the constructed belief-tree as new information is obtained from the environment. There exists limited work (mainly associated with adapting spoken dialogue systems formalized as POMDPs) that relax the a-priori information availability to some extent and use ”similarity functions/metrics” to identify how new information can be accounted for in the model. Figure 4.1 presents an overview of the transfer learning technique framework [96]. Figure 4.1: Overview of transfer learning framework to transfer knowledge from source task and envi- ronment to target task and environment by developing a mapping function to identify efficient value and knowledge transfer for adaptation 4.2 Transfer Learning and Adaptation Literature Review and Related Work In transfer learning, knowledge and experience gained from performing one or more source task(s) in an environment is used to learn one or more target task(s) faster than if transfer was not used. The main insight behind transfer learning is that generalization may occur not only within task(s) but also across task(s) [47]. Available transfer learning methods mainly differ in how they allow 93 source and target tasks (or environments) to differ. Some of the questions that distinguish transfer methods and techniques include: 1) What are the goals of the transfer method? 2) What metrics can be employed to measure the value and success of the transfer? 3) What assumptions, if any, are made regarding the similarity between tasks and environments? 4) How does a transfer method identify what information can or should be used during transfer? and, 5) What information is transferred between tasks [47]? From the questions mentioned above, identifying appropriate metrics to measure the value and success of transfer learning method is very crucial. Depending on the employed transfer learning method, the following metrics can be employed to identify the value and success of the transfer: • Jumpstart: The initial performance of an agent in a target task and environment may be improved by transfer from a source task and environment. • Asymptotic Performance: The final learned performance of an agent in the target task and environment may be improved via transfer. • Total Reward: The total reward accumulated by an agent (i.e., the area under the learning curve) may be improved if it uses transfer, compared to learning without transfer. • Transfer Ratio: The ratio of the total reward accumulated by the transfer learner and the total reward accumulated by the non-transfer learner. • Time to Threshold: The learning time needed by the agent to achieve a pre-specified perfor- mance level may be reduced via knowledge transfer. The existing transfer learning techniques can be categorized depending on the assumptions, metrics employed (e.g., [97]), an how they adapt/change the underlying model, assumptions, or policies during transfer. For the purpose of this thesis, the available POMDP-related adaptation via transfer learning is categorized as follows: 1) Policy adaptation, and 2) Model/data augmentation. While policy adaptation mainly focuses on modifying the policy (online) as new information is revealed from the environment without modifying/adapting the model, the second category of techniques 94 account for adaptation in the model by augmenting the state, observation, and belief spaces. It is important to note that model/data augmentation techniques are usually combined with policy adaptation techniques, so that the policy can adapt to changes in the model. The diagram provided in figure 4.2 summarizes the related transfer learning techniques and approaches employed for POMDPs. Figure 4.2: Summary of existing TL techniques and metrics in POMDPs 4.2.1 Policy Adaptation Policy adaptation is associated with learning an optimal policy for an environment and modifying the policy as the environment goes through changes [98]. There exist various algorithms for pol- icy adaptation, such as Simple Online Value Iteration (SOVI) [98], Adaptive Belief-Tree (ABTB) [56], Point-Based Policy Transformation (PBPT) [57], Adapt-To-Learn [99], and SEAPoT [100]. These algorithms account for policy adaptation by identifying the changes in belief probabilities (resulting from changes in the environment) and modify the belief tree (i.e., branches and nodes resulting from changed beliefs) to adapt the policy. For instance, SOVI incrementally adapts the 95 policy by locally improving the available value function after each action execution [98]. ABT uses an augmented belief tree for policy representation that enables updating policies by modify- ing the estimated values (i.e., cumulative rewards, Q-values) [56]. In addition, some other techniques are employed for policy transfer as well [101, 102, 103]. For instance, Gaussian Process (GP) in POMDPs are used to model system-environment interactions, where the posterior of the mean of the Q-function estimated at the source is employed as the prior of the mean for Q-function to be estimated at the target problem [101, 102]. Although these algorithms can achieve fast and online policy adaptation, they only focus on updating the policy and do not address adaptation in the model itself. On the other hand, these algorithms can handle only a sub-set of changes in the tasks or environments (e.g., changing envi- ronment dynamics (transition and emission functions and probabilities), and changes in objectives (reward function)). For instance, the ABT algorithm cannot handle changes in state variables [56]. 4.2.2 Adaptation via Model/Data Augmentation Adaptation via model and/or data augmentation is another technique that employs ”similarity func- tions or metrics” to adapt the model by augmenting the state variables, observations, or belief- spaces [104]. The similarity function or metric identifies what information from the original model (source POMDP) can be employed to transfer knowledge and experience [101, 102, 103]. This technique is mainly employed when the target problem has additional information (e.g., new states), which is not available in the source (original POMDP) [104]. There exists only a limited amount of research work associated with adaptation via model/data augmentation that are mainly employed for Spoken Dialogue Systems problems formalized as POMDPs. For instance, to enable cross-domain portability in GP-POMDPs, researchers in [101] formulate a kernel function for defining correlations between beliefs from differing domains. If the target problem-domain includes an extra feature (new state variable), the correlations between the extra and existing states are defined by specifying which feature(s) in the source are most similar to the new features. The similarity function employed in this research is defined heuristically based 96 on the cardinality of the features. In another example, the previous work is extended to domains with additional new features, where the belief (vector) is extended (augmented) to account for new hidden states associated with the new attributes [102]. The similarity between new and existing features are measured by comparing cardinalities and then, the transition probabilities of new states are defined as the transition probabilities for the states that correspond to the most similar attributes [101, 102]. 4.3 Adaptation via Expansion and Post-Expansion Refinement To address the aforementioned limitations and gaps in the existing adaptation and transfer tech- niques, this thesis provides a novel hybrid model-based, data-driven technique to enable adaptation and refinement in POMDP models that are defined using the Expandable-Compact POMDP mod- eling technique. This technique builds upon the existing ideas from transfer learning and employs data-driven techniques to augment the model (state, observation, and belief spaces) via gradually expanding the model to account for new information and dynamics. Furthermore, post-expansion refinement is performed to further improve the performance of the adapted (expanded) models. Specifically, this technique includes two phases: 1) Online Adaptation via Expansion and 2) Of- fline Post-Expansion refinement. In the initial phase, which is performed online (model interacts with its environment), new, previously unseen information is collected and accounted for by adding new states and observations to the model, augmenting the belief and belief-space, and updating underlying dynamics based on measured similarity between new and existing data. In the offline phase (post-expansion), statistical and visual data analysis techniques are employed on perfor- mance data obtained from the adapted model as well as the collected data during online adaptation to identify possible inaccuracies, isolate inefficiencies, and make further refinements (e.g., adding new actions, removing, or joining states). An overview of the proposed technique is presented in figure 4.3. It is important to note that at time-step t = 0, since the original model is not modified (expanded or refined) yet, the POMDP tuple notation is simply defined as< S;A;W;T;O;R> (i.e., 97 Figure 4.3: Overview of phases 1 and 2 of proposed adaptation and refinement technique. simF shows a similarity function, S sim presents a subset of states with high similarity to the unknown observation, where each state is assigned with a similarity weight simWeights, and M shows the Expandable-Compact POMDP model only the information available at the outset that is used for defining the original model is used for adaptation). The most important aspects of the presented technique can be summarized as follows: • Similarity function is formulated based on results of data analysis performed on data used for the existing model. • Similarity is employed for augmenting (by expanding) the model and underlying dynamics. • The new environment is not known beforehand and the differences are identified and learned during adaptation. • Scalability is accounted for during model augmentation and expansion by avoiding initial- ization of redundant states using a labeling function. • Augmented (adapted) model and collected data provide a basis for performing data analysis to measure performance, compare models, and perform further refinements. 98 The following paragraphs explain the adaptation via expansion algorithm and post-expansion re- finement techniques. 4.3.1 Online Adaptation via Expansion Online adaptation via expansion technique and algorithm is developed and implemented by lever- aging on ideas from model/data augmentation techniques. The main assumption in this algorithm is that the observations that cannot be interpreted based on the existing state-space are associ- ated with states that are currently missing from the model. Thus, this algorithm attempts to find the missing states and estimate the underlying dynamics by relying on the ”similarity” measured between new and known observations. Similar to the existing adaptation via augmentation tech- niques, the key idea behind the online adaptation via expansion algorithm is defining a similarity function. In this technique (and algorithm), the similarity function is defined by only relying on the available model and information (e.g., state-space definition, distribution of observations). For this purpose, the distributions of observations associated with compact states are employed to formulate a similarity function. This function can be formulated differently (e.g., to calculate similarity using correlation or distance) depending on the context of the problem and type of data. This function is continuously used during the online adaptation via expansion for identifying the most similar states (S sim S) with new, unseen observations (o 0 = 2W), where each similar state (s 0 2 S sim ) is as- signed with a normalized similarity weight w s 0. After finding the most similar states and estimating similarity weights, the dynamics of the new state resulting from the new, unknown observation can be estimated using a weighted sum over the dynamics of similar states, where weights are associ- ated with the measured similarities. Equation 4.1 provides the overall formulation for estimating dynamics of a newly added state H 0 = 2 S, and associated observation(s) o 0 = 2W using similarity weights and dynamics of the similar states. In this equation, H 0 shows a possible new state in the expanded state-space (H 0 2 S + ;= 2 S), o 0 = 2W is the new, previously unseen observation, o2W(s 0 ) shows observations associated with state s 0 2 S sim , and “:” in F(X;:;Y) means8a2 A and means 99 8sinS in F(:;X;Y). The online adaptation via expansion algorithm uses this property to expand the model and estimate dynamics of new states. Estimation= 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : T(H 0 ;:;H 0 )=S s 0 2S sim w s 0 T(s 0 ;:;s 0 ) T(:;:;H 0 )=S s 0 2S sim w s 0 T(:;:;s 0 ) O(H 0 ;:;o 0 = 2W)=S s 0 2S sim w s 0 O(s 0 ;:;o2W(s 0 )) O(:;:;o 0 = 2W)=S s 0 2S sim w s 0 O(:;:;o2W(s 0 )) R(H 0 ;:)=S s 0 2S sim w s 0 R(s 0 ;:) (4.1) When adding new state(s) and dynamics to the model, the belief (b t =[b s 1 ;b s 2 ;:::;b s N ], where jSj= N) (and the belief-space) need to expand (by augmentation) so that the belief probability of the newly added state(s) (e.g., H 0 ) can be taken into account when observations for this state become available. Every time a new state is added to the model and dynamics are initialized (or estimated), the belief is expanded by augmenting the belief probability distribution as shown in equation 4.2. b Ex t = b s 1 b;:::;b s N b ; Nb s:t:b= min(b t ) N (4.2) where b Ex t shows the expanded (i.e., augmented) belief at time-step t, b s i ; i2f1;2;:::;Ng shows belief of s i . In other words, the belief is augmented by assigning a small probability to the newly added state (e.g., H 0 ) at the time of expansion (t). At the time of expansion (t =t, when new possible states and observations are added to the model and dynamics are initialized), the belief probabilities associated with the similar states (8s 0 2 S sim ) are also taken into account. In other words, (b t (8s 0 inS sim )) becomes the prior proba- bility associated with the expected outcome (i.e., estimated policy) after transitioning to this state at the time of expansion. By doing so, situations that there is a high similarity between the un- known observation o 0 and states s i and s j , where w s i w s j , but b t (s i ) b t (s j ) are accounted for. During the online adaptation, this algorithm finds new states (and associated observations) and adds them to the existing model. However, if the state-space gets continuously populated with new 100 states resulting from individual new, unseen observations, the size and complexity of the POMDP model will rapidly increase leading to computational and memory issues [44, 57]. Similar to the Expandable-Compact POMDP formulation, this issue can be addressed by grouping the similar observations and representing them using a single state. Then, the new estimations obtained based on similarity measured between the new observation and existing states can be used to update the previous estimations for that state instead of initializing redundant, new state. To identify similar groups of new, unseen observations, a labeling function is implemented that, for each new, un- known observation, generates a label representing the most similar states. For instance, assume o 0 l and o 0 k = 2W represent two new, unseen observations whose similar states are s i and s j where w(s i )> w(s j ), then the label associated with these observations will be label = ”ij”, implying that both of these observations belong to the same state and observation group initialized based on sim- ilarity with s i and s j . The pseudo-code for this function is provided in Algorithm 3. The results of Algorithm 3 Pseudocode for Labelling Function 1: function GENERATELABEL(o 0 t ;C=fC(s 0 );::;C(s n )g) 2: SimiW [] 3: for c2C do 4: sim SimF(c;o 0 t ) 5: insert sim into SimiW 6: fW max g FindMLargest(SimiW) 7: fIndex max g sort(FindIndex(W max ;SimiW)) 8: ”label” ConvertToString(Index max ) 9: returnfW max g;fIndex max g, ”label” comparison on size of the state-space and policy estimation time (using N-Step Look-Ahead online algorithm) when labeling is employed versus model size and estimation time when the state-space is continuously populated with new states each (representing an individual, new observation) are provided in figure 4.4. As shown in this figure, the policy estimation time (purple dotted line) increases to tens of seconds (violating real-time performance requirements and constraints) as the size of the state-space grows larger than e.g., 20 states. 101 Figure 4.4: Exemplar comparison of model size and policy estimation time (using N-Step Look-Ahead) when no labeling is employed (purple) vs. when labeling is used to reduce the size and complexity of model (blue) After initializing the dynamics of the new state(s) based on measured similarity and prior prob- abilities (belief probability of8s 0 2 S sim ), as new observations associated with the initialized states become available, the similarity measured from these observations is employed for updating the old estimations. For this purpose, the following questions need to be answered: 1 - How is the expected outcome (depending on prior probabilities and estimated dynamics) affected when more data becomes available? 2- How do we update previous estimations based on new data? 3 - Which observations should be used for updating the estimations? To answer the first question, it is im- portant to note that after the time of expansion (e.g., time-step t), as more information about the newly added state(s) become available, the initial estimations that are based on the belief proba- bilities at time-stept,(b t (8s 0 2 S sim ) become less accurate. This happens as the belief associated with8s 0 2 S sim gets updated and changes after someDt. Thus, the effect of the priors on estimated dynamics should decrease gradually. To account for this, a decay factor 0 < z < 1 is defined, based on which the effect of the prior probabilities that were employed in initializing dynamics 102 can be gradually decreased as more data for that state becomes available so that the estimations can be obtained only based on the collected data for that state. As shown in equation 4.3, these probabilities gradually fade away as(b t (8s 0 2 S sim ) converges to a normal distribution. 8s 0 2 S sim : b t+Dt (s 0 ) z Dt s:t: b t (s)! 1 jS sim j (4.3) Figure 4.5 illustrates how prior probabilities(b t (8s 0 2 S sim ), decay after expansion as more obser- vation become available. To account for the similarity weights and decaying prior probabilities in estimating dynamics as more data becomes available, the dynamics estimation equation presented in equation 4.1 gets updated as shown in equations 4.4 and 4.5, where equation 4.4 is used for estimating new dynamics from the new observation and equation 4.5 is employed for normalizing the estimated probabilities so that transition and emission probabilities associated with a state and action sums up to 1. Estimation= 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : T(H 0 ;:;H 0 )=S s 0 2S sim w T s 0 T(s 0 ;:;s 0 ) T(:;:;H 0 )=S s 0 2S sim w T s 0 T(:;:;s 0 ) O(H 0 ;:;o 0 = 2W)=S s 0 2S sim w T s 0 O(s 0 ;:;o2W(s 0 )) O(:;:;o 0 = 2W)=S s 0 2S sim w T s 0 O(:;:;o2W(s 0 )) R(H 0 ;:)=S s 0 2S sim w T s 0 R(s 0 ;:) (4.4) Normalization= 8 > > > < > > > : T(H 0 ;a;:)= T(H 0 ;a;:) S T(H 0 ;a;:) s:t:S s2S +T(H 0 ;a;s)= 1 O(H 0 ;a;:)= O(H 0 ;a;:) S O(H 0 ;a;:) s:t:S o2W +O(H 0 ;a;s)= 1 (4.5) Where “:” symbol in T(H 0 ;a;:) means8s2 S + =fS[ H 0 g, “:” in O(H 0 ;a;:) means8o2W + = fW[ o 0 g, w T s 0 shows the ultimate similarity weight associated with s 0 2 S sim and T th observation 103 Figure 4.5: Example of belief decay as the cluster of newly added state H 0 gets populated and dynamics get updated associated with H 0 obtained after someDt and is calculated as a combination of (decayed) prior and similarity weight associated with s 0 as shown in equation 4.6. w T s 0 = b t (s 0 )z T w s 0 S s 0 2S sim b t (s 0 )z T w s 0 s:t:S s 0 2S sim w T s 0 = 1 (4.6) 104 Introducing a normalized version of weights is important because weights are associated with the likelihood of the past data (i.e., available states and known dynamics), so it’s necessary to adjust obtained values to common scale [105]. Question 2: Every time a new, unseen observation is received and labeled, the online adapta- tion algorithm searches for this label in the memory. If the label does not exist in the memory, a new state is added to the model and a new group is initialized in the observation-space. On the other hand, if the label exists in the memory, which implies that the observation can be addressed by one of the newly added states, this observation is added to the group (e.g., cluster) of obser- vations associated with that state. Then, the information (i.e., estimated new dynamics based on measured similarity with the new observation) from this new observation is employed to update the previously estimated dynamics for the associated state. To keep track of the labels and number of observations within each new cluster (group), a hash table data structure (space complexity of O(n) and search time of O(1)) is employed, where the labels are stored as the keys and number of observations for each newly initiated cluster (label) represents the values in the hash table. As the clusters associated with new states get populated during adaptation, the previous estimations get updated by estimating new dynamics using the observed data and then, calculating the average of the new and previous estimations as shown in equation 4.7. K updated (H 0 j o 0 0 o 0 1 :::o 0 t+1 )= mK old t (H 0 j o 0 0 o 0 1 :::o 0 t )+K new t+1 (H 0 j o 0 t+1 ) m+ 1 (4.7) where m shows the number of observations assigned to the cluster associated with H 0 , which is readily available from the hash table. K updated can be substituted with T, O, and R, where K new t+1 shows the new estimation based on the new observation (obtained using equations 4.4 and 4.5), andK old t shows the previous estimation. The third question can be answered by defining a criterion that identifies whenK new t+1 (estimation based on the new observation) can be used for updating the previous estimations. In this algorithm, the criterion is defined heuristically based on the experience gained from policy estimation using the original POMDP model. In other words, since the dynamics of the new states are estimated 105 based on similarity weights and belief probabilities of a subset of similar states (S sim ) in the original POMDP, the policy associated with the new states should also be similar to the policies estimated for states within S sim . Thus, the dynamics of the new states are only updated when the policy estimated from K new t+1 , (T new H ;O new H , and R new H ) matches (similar or does not contradict with) the policy of states within S sim . In addition, the dynamics continue to get updated until the policy of the new state converges to the policy associated with the state with the highest similarity. Figure 4.6 provides the flow chart for the adaptation via expansion technique. The pseudo-code for this algorithm is provided in Appendix. 106 Figure 4.6: Flow chart of the Adaptation via Expansion technique. o 0 t is the new unknown observation obtained at time-step t, S sim shows the similar states and SimDict is the hash table employed for storing the labels and numbers of observations associated with newly added states 107 4.3.2 Post-Expansion Refinement The online adaptation phase leads to identifying new states, new observations, and their estimated dynamics. In the post-expansion phase, statistical and visual data analysis techniques are employed on collected data and the performance of the expanded POMDP to identify existing, if any, inac- curacies and inefficiencies and perform further refinements. These refinements include, but are not limited to the following: • Deciding which states to keep in the state-space • Joining a subset of existing observation clusters and updating the observation-function • Adding new actions to the action-space and estimating dynamics for the new action • Updating predefined heuristics Figure 4.7 provides exemplar updates to the original POMDP model during adaptation via expan- sion and after post-expansion refinements. It is important to note that adding new actions to the Figure 4.7: Example of how POMDP changes during online adaptation via expansion (H 0 and H 1 their observations are added) and after post-expansion refinements (s 0 and s 3 and their observations are joined and new action (blue) added) 108 model requires verifying the feasibility of the action with respect to physical constraints and re- quirements of the system (i.e., defining physics models for an A V and verifying feasibility with respect to physical constraints). 4.4 Adapting and Refining Lane-Keeping and Lane-Changing Models In this section, the experimental results of applying the online adaptation and offline refinements to the lane-keeping and lane-changing models are provided. Initially, the algorithm and technique is applied to the lane-keeping model and the application process (including data analysis results) are explained in details. Later, this technique and algorithm is employed for adapting and refining the lane-changing model with respect to a risky, unsafe free-way environment. 4.4.1 Adaptation and Refinement in Lane-Keeping Model The online adaptation via expansion algorithm and performed post-expansion refinement is applied to the lane-keeping model, which was originally designed for a safety-critical application of A Vs in a simple lane keeping use-case scenario. The goal of this experiment is to adapt the existing model to a new, complex environment by building up on the existing model and data available at the out- set. The original POMDP was designed to perform collision-free, safe, and smooth lane-keeping tasks in a multi-lane freeway environment, where the traffic distribution within the perimeter of the A V was described based on the number of vehicles in each lane, distances between vehicles (Dd), and relative velocities (Dv). The main assumption for this use-case scenario was that the traffic dis- tribution (e.g., distances and velocities) changed randomly, but none of the vehicles demonstrated sudden behaviors. The observations from this environment at a given time-step t are defined as o t = Dv t ;T TC t ;Dd t A VX , where X shows the vehicle agents within perimeter of A V and TTC stands for Time-To-Collision. On the other hand, the goal in the new use-case scenario and en- vironment, a.k.a Late Reveal scenario, is to perform safe, smooth, and collision-free lane-keeping 109 in presence of sudden behaviors (e.g., hard brake (stopping on the lane) and cutting off to other lanes) from the vehicles within the perimeter of the A V , which results in new observations that are not addressed in the original POMDP. To begin the adaptation process, a similarity function is needed to be formulated. In this ex- periment, the similarity function is defined and formulated by analyzing the distribution of obser- vations associated with the states (i.e., clusters of observations representing states). Specifically, the focus in this analysis is associated with evaluating the mean and standard deviation of obser- vation distributions within each cluster, where the obtained statistics are employed for formulating the similarity function. For the purpose of this experiment, the similarity function is formulated as a weighted Euclidean distance function, where the weights associated with the features (i.e., state variables Dd;Dv;T TC ) for each individual cluster are defined inversely proportional to the standard deviation of that cluster. Equation 4.8 provides the similarity function. w sim (c(s2 S);o t = Dd t ;Dv t ;T TC t )= 1 w dist (o t ;Cent c = c Dd ;c Dv ;c T TC ;w c = w Dd ;w Dv ;w T TC ) = 1 p w Dd (c Dd Dd t ) 2 + w Dv (c Dv Dv t ) 2 + w T TC (c T TC T TC t ) 2 (4.8) where c(s2 S) shows the cluster of observations associated with state s, Cent c provides the centroid of observation cluster c, and w c provides the assigned weight vector for cluster c. For instance, if the observations within a cluster are randomly distributed along the TTC axis (i.e., high standard deviation), this implies that the TTC measurements of those observations do not contain useful information about the characteristics of that cluster, thus, the weight associated with TTC variable for that cluster is a small value (µ 1 std(T TC c ) ). Figure 4.8 shows the distribution of clusters in the available (original) observation-space over the distance and relative velocity axes. In this ex- periment, as observations are obtained from the environment at every time-step (o t ), the weighted Euclidean distance is measured from the center of each observation cluster. If the distance from a specific cluster is small enough (e.g.,e), the observation is assigned to that cluster, otherwise, MjSjclosest clusters (similar states) are identified, and the state-space is expanded to account 110 Figure 4.8: Distribution of observation clusters overDd andDv for the new information. To this end, the online adaptation via expansion algorithm is evaluated in 138 different sce- narios (simulated and implemented in CARLA [106]), where each scenario (and the severity of sudden behaviors) were determined based on the parameters, such as: initial distances between ve- hicles, target speed, brake acceleration (for vehicle agent that apply hard brake), delay swerve, and swerve duration (for vehicle agents that cut off to other lanes). The decay factor (0<z < 1) for each scenario is selected experimentally depending on the scenario parameters. Figure 4.9 illus- trates an example of how beliefs of similar states (used for initializing H 1 in this experiment) vanish gradually (z = 0:06) as more new observations are employed for updating the initialized dynamics for this state. The results obtained from evaluating the adaptation via expansion algorithm in 138 different scenarios indicate that 3 possible new states are missing from the original POMDP model, where the observation clusters for 2 of these states (H 0 and H 1 ) are densely populated with new, 111 Figure 4.9: Example of how beliefs of s 1 and s 2 decay as new observations are used for updating the dynamics of H 1 unknown observations. Figures 4.10 and 4.11 show the distribution of observations and clusters in the new observation-space and the distribution of estimated rewards (include all estimations and updates) for two states with dense observation clusters, respectively (H 0 – based on similarity with s 1 and s 2 (pink), and H 1 – based on similarity with s 1 and s 3 (light green) states). To evaluate how the criterion associated with updating dynamics using new estimations af- fects the performance (safety of the resulting policy), the results obtained from adaptation while accounting for the criterion is collected and compared to the results obtained when the criterion was not used (all possible new estimates are used to update the dynamics of newly added states). The estimated policies from this experiment are divided into the following categories depending on the final crash velocity: 1) Low Severity or No Crash, 2) Severe, and 3) Highly Severe. Table 4.1 provides the results from this evaluation. As shown in this table, the percentage of policies leading to high severity crashes is decreased (4.03% to 1.88%) when the estimations are updated only using information that satisfies the criterion. After applying the online adaption algorithm, the performance of the expanded POMDP model 112 Figure 4.10: New observations identified during adaptation: left and right show the original and new observation-space, respectively including clusters associated with new, previously unseen observations Figure 4.11: Range and distribution of estimated reward values for states associated with the pink and light green clusters (new states) is evaluated and analyzed using visual and statistical data analysis techniques to identify sources of inefficiencies and perform further refinements. For this purpose, the TTC during each scenario 113 Table 4.1: Comparison of policies estimated during online adaptation via expansion (Update criterion en- abled vs. discarded) Policy Category % No Update Criterion % Update Criterion Low Severity 83.45 % 96.78 % Severe 12.53 % 1.34 % Highly Severe 4.03 % 1.88 % is measured. In addition, the rate of changes in A V’s velocity (resulting from policies) is compared to the rate of changes in measured TTCs (i.e., how the estimated policy that leads to changing the velocity of A V affects the measured TTC). The results of data analysis indicate that the new states H 0 and H 1 are mainly resulting from situations where the vehicles in front of the A V are forced to change lanes to avoid collision with another vehicle that suddenly stops on the lane (i.e., late reveal). For these situations, the policy estimated using the N-Step Look-Ahead algorithm (dur- ing online adaptation) initially results in maintaining status quo (a 0 2 A DL ) when H 0 (similar with s 1 2 S DL and s 2 2 S DL ) is initialized (early observation are closer to observations associated with s 2 ). As the observations gradually become similar to observations associated with s 1 2 S DL , the estimated policy results in slowing down by performing action a 2 2 A DL to avoid collisions with the stopped vehicle (the stopped vehicle only becomes visible after the vehicle in front of the A V has finished cutting off to the other lane). The policy associated with H 1 (similar to s 1 2 S DL and s 3 2 S DL ) always results in slowing down (a 2 2 A DL ) to avoid transitioning to state s 3 (failure). Figures 4.12 and 4.13 show the measured TTC for 138 scenarios and compare the rate of change in A V’s velocity (i.e., derivative) versus the measure TTC for each scenario, respectively. Since the duration of scenarios (time-steps) depend on the initial scenario parameters, a time vector (size equal to the maximum duration) is initialized and all the data-points (measurements) in shorter scenarios are shifted to the end of the time vector and the beginning is initialized with zeros (dark blue areas in figure 4.13). As shown in figure 4.12, the TTC measured from the scenarios initially decreases linearly, but drastically drops afterwards, indicating a collision or about to crash situa- tion. In addition, comparing the derivatives of A V’s velocities resulting from estimated policies in 114 Figure 4.12: TTC measures resulting from policy estimation using N-Step Look-Ahead during online adap- tation these scenarios with the derivatives of TTCs indicates that although the A V attempts to slow down to avoid collision when state transitions to H 0 and H 1 , the rate of changes in the velocity (resulting from -1 m s 2 ) is not sufficient to increase the TTC (avoid collisions). This is shown in figure 4.13 Figure 4.13: Comparison of rate of changes in TTC vs. A V’s velocity where A V’s velocity is controlled by the policies estimated during adaptation 115 (left) where percentage of dark blue areas (derivative 0.05) for each scenario (ignoring the dark blue areas at the beginning of each row) is much larger than percentage of green, yellow, and red areas (derivative > 0.05)). As shown in this figure, the estimated policy after online adaptation avoids collisions in a few scenarios (rows that end in red colors ( D(T TC) time step 0:15), where the lane changing happens much earlier than sudden stopping on the lane, leading the A V to slow down for a longer period and coming to a full stop before crashing to the front vehicle. The results of the above analysis imply that to avoid collisions in such scenarios, the A V needs to slow down by applying maximum brake acceleration (declaration) when observations demon- strate high similarity with state s 3 2 S DL (failure/collision). This can be addressed either by adding a new action to the action-space, e.g., a 3 : Brake with max acceleration or by defining heuristics to interpret action a 2 2 A Dl differently depending on the belief probability distribution (e.g., if b t (H 1 ) 0:5 and policy:b t ! a 2 2 A DL , then apply maximum break acceleration; if b t (s 1 ) 0:5 and policy:b t ! a 2 2 A DL , then apply(-1 m s 2 ) break acceleration). To this end, following refine- ments are employed after expansion: R1) Refined State (S) and Action (A) Spaces (Refined S&A), where H 0 is discarded, the clusters associated with s 2 2 S DL and s 1 2 S DL are expanded to account for observations associated with H 0 , and action a 3 is added to the action-space. Reward (penalty) for transitioning to H 1 is defined as the mean of estimated rewards for this state as shown in fig- ure 4.11 (R(H 1 )=8:9) and the transition and emission probabilities associated with the new action are initialized similar to action a 2 2 A DL and fine-tuned experimentally; and R2) Refined State-space and Heuristics (Refined S&Hrstc), where both H 0 and H 1 are kept in the model and heuristics are defined for interpreting actions based on belief distribution. The performance of the refined models is later compared to the online adapted POMDP (before refinements were applied) and a rule-based controller (implemented to avoid collisions in late reveal situations). For further evaluation and comparison purposes, the performance of the refined POMDPs are also compared to performance achieved from decision-making in the new environment using transferred policies. To obtain the transferred policies, a policy transfer technique and algorithm is implemented that employs the source (originally defined) POMDP (e.g., belief-space) and estimated source Q-values 116 to approximate the target belief-space and Q-values by relying on a mapping function. Details of the transfer learning technique implemented for the purpose of this experiment is discussed later in this section. The performance metric in this experiment is formulated as a function of final TTC (indicat- ing failure or success), robustness, and policy estimation time. The robustness of refined and/or adapted POMDP models is described based on cumulative changes in decisions resulting from changes in observations (i.e., how often the POMDP changes its decision given the changes in observations). Equation 4.9 provides the performance metric employed for evaluating the perfor- mance of the techniques and models employed in this experiment. Per f ormance= T TC CumulativeChange+ T(PolicyEstimation) (4.9) Figure 4.14 provides and compares the performance achieved from decision-making using the models and techniques mentioned above. As shown in this figure, the POMDP with refined state and action spaces (R1 - Refined S&A) achieves the highest performance. Table 4.2summarizes the performance statistics for the evaluated models and techniques. Figure 4.14: Comparing performance of refined POMDPs and expanded (online adaptation) to transferred and rule-based techniques 117 Table 4.2: Summary of performance statistics (TTC, Robustness, Policy Estimation Time) for models Model/Technique TTC Stats (s) Decision Change T(Policy Estimation) (s) Refined S & A [14.8, 94.3, 213.6] [6.3, 8.5, 11.3] 0.07 Transfer [14.3, 93.4, 209.9] [7.2, 9.5, 12.3] 0.0 (offline) Refined S & Hrstc [14.8, 94.3, 213.6] [7.6, 11.9, 17.4] 0.14 Online Adapt. [0.67, 15.8, 45.9] [7.6, 11.9,17.5] 0.14 Rule-Based [0.64, 1.39, 2.78] [177, 251, 308] 0.0 (no estimation) The results indicate that the performance of the refined (S & A) POMDP is slightly better than the refined (S & Hrstc) POMDP as the policy estimation time using N-Step Look-Ahead algorithm is smaller due to a relatively smaller state-space. In addition, the performance of refined (S & A) POMDP is better compared to performance obtained from transferred policies and Q-values. This is due to the fact that transfer does not perform well for beliefs where probabilities are evenly distributed over the states. The performance of expanded (online adapted) POMDP in average, is lower than the refined and transferred POMDPs and this is mainly due to the insufficiency in the action-space. The rule-based controller has the lowest performance although it uses maximum break acceleration when risky situations occur. This is mainly due to the fact that the control ac- tions immediately change resulting from small changes in measurements. The performance results obtained from this experiment are based on performing statistical data analysis on results obtained from testing and evaluating the models in a series of scenarios within a simulated environment. For the sake of completeness, the performance and robustness of the original POMDP and the refined (S & A) POMDP models are evaluated and compared using a probabilistic, model-based technique. The results obtained from this evaluation also confirm that the state and action spaces of the original POMDP model are not sufficient for safe, collision-free, and smooth planning and decision-making in a late-reveal scenario. This technique compares the models by associating a probability distribution function (PDF) to each model (i.e., representing the behavior of model and resulting decisions using a probability space) and estimating the ”ex- treme” and ”expected” probabilities of transitioning to failure states based on the PDF [107]. In 118 other words, the main idea behind the probabilistic, model-based technique is to associate an over- all “failure probability” to the available POMDPs based on their belief-spaces and compare their performance using the estimated failure probability. The failure probability of a belief within the belief-space is described as a combination of belief probabilities associated with states, the transi- tion probability from states to failure state(s), and emission probabilities associated with observing a failure from those states. Thus, the failure probability of the models can be calculated based on all the failure probabilities estimated for each belief obtained from the POMDP. Figure 4.15 provides the flowchart for the probabilistic, model-based evaluation technique employed in this experiment. Figure 4.15: Overview of steps for performing probabilistic, model-based comparison based on expected and estimated failure probabilities from models Step 1: The first step in this technique is associated with developing a pdf for the original and refined (S & A) POMDP models, where each pdf summarizes the behavior of the model in the 119 environment. This information can be obtained by exposing the POMDP models to a series of ac- tions and observations (e.g., a 0 o 0 a 1 o 1 a 2 o 2 ::: a n o n ) expected from deploying the model(s) in the environment, where the information regarding each interaction and history of previous interactions are summarized in the beliefs as shown in equation 4.10. (b t j a 0 o 0 a 1 o 1 a 2 o 2 ::: a t1 o t1 ) =(b t j(b t1 j a 0 o 0 a 1 o 1 a 2 o 2 ::: a t2 o t2 ) a t1 o t1 ) =(b t j b t1 a t1 o t1 ) (4.10) It is important to note that constructing the belief-space requires assuming reasonable series of ac- tions and observations, since not all possible pairs of actions and observations are neither valuable nor achievable given a POMDP and its environment. To account for this, the belief-spaces are es- timated and constructed by following a policy where the observation obtained after performing an action is randomly selected (given the emission probability of the action and current belief) from the emission matrices of the POMDP models. Within the belief-space, each belief (b i 2 B) is accompanied by an action determined by the policy (resolution of non-determinism [108]), based on which probability of transitioning to the failure state(s) from that belief and action pair can be obtained as shown in equation 4.11. pr transitionToFailure (b t ;a t )=S s f 2S F ;o f 2W F S s2S b t (s) pr(s f js;a t ) pr(o f js;a t ) (4.11) where S F S determines subset of failure states,W F W is the subset of observations indicating failure, pr(s f js;a t ) shows the transition probability of performing a t in s and transitioning to s f , and finally, pr(o f js;a t ) provides the emission probability of performing a t in s and receiving fail- ure observation o f . Steps 2, 3, and 4: Calculating the expected probability of transitioning to failure states based on the belief-action pairs within the belief-space should account for 1) the failure probability of beliefs within the belief-space and 2) how frequently those beliefs are obtained from the POMDP. 120 This probability cannot be directly calculated from the belief-space, since the belief-space is con- tinuous, instead it can be approximated by discretizing the belief-space into a distinct number of belief clusters as shown in equation 4.12. pr transitionToFailure (B)=S c2C w c mean(pr transitionToFailure (b t ;a t jb t 2 c)) (4.12) where w c determines the frequency/size of the cluster c within the discretized belief-space and mean(pr transitionToFailure (b t ;a t jb t 2 c)) is the average probability of transitioning to failure states from belief distributions within cluster c. Similarly, the extreme probability values can be estimated by calculating the minimum and maximum weighted average probability of transitioning to failure states based on the clusters as shown in equation 4.13. Y transitionToFailure (B)=S c2C w c mean(Y(b t (s) pr(s f js;a t )) pr(o f js;a t )) (4.13) whereY can be substituted with min or max function. Finally, after estimating extreme and ex- pected failure probabilities for each POMDP, the models can be compared based on these proba- bilities. The probabilistic, model-based evaluation and comparison technique described above is ap- plied to the original lane-keeping model and the refined (S & A) model. After estimating their belief spaces and discretizing the belief spaces into clusters, the expected failure probability for both models is calculated using equation 4.12. Figure 4.16 provides the size and average probabil- ity of transitioning into failure state for available clusters in the estimated belief-spaces. Based on the results presented in figure 4.16, the expected probability of failure for the original and refined POMDP models are estimated. This probability is 0.1178 for the lane-keeping model and 0.0866 for the refined POMDP model. The extreme probabilities of failures are also estimated for these models using equation 4.13, where the results are provided in figure 4.17. As shown in these figures, employing the refined (S & A) model in the complex, lane-keeping environment (i.e., late 121 Figure 4.16: Estimation of probability of transitioning to failure state based on belief-spaces Figure 4.17: Distribution of min-max probabilities of transitioning to failure states for the original and refined lane-keeping models reveal scenario) is expected to lead to lower failure probabilities compared to the original lane- keeping model, which confirms the results obtained from evaluating these models in the simulated scenarios. The results obtained from the probabilistic, model-based evaluation of the original lane-keeping 122 model can be further analyzed to identify and isolate the sources of inefficiencies in the model. For this purpose, the belief clusters that mainly result in a large values of expected probability of tran- sitioning to the failure state(s) are analyzed. In other words, medoids of clusters with average failure probabilities pr f ailure 0:15 are identified and belief probabilities associated with different states are analyzed. Table 4.3 provides the medoid belief probability associated with each cluster in the original POMDP belief-space and the expected failure probability for that cluster. Analyz- Table 4.3: Analysis of belief probabilities and probability of failure for original lane-keeping model Cluster ID Size (s) b(s 0 ) b(s 1 ) b(s 2 ) b(s 3 ) pr F 1 16 0.25 0.25 0.25 0.25 0.252 6 45 0.01 0.73 0.08 0.18 0.157 11 1 0.07 0.43 0.18 0.32 0.318 12 42 0.01 0.60 0.06 0.34 0.290 13 19 0.09 0.02 0.04 0.85 0.786 14 9 0.01 0.51 0.04 0.44 0.368 15 26 0.02 0.01 0.001 0.98 0.818 16 1 0.23 0.04 0.13 0.60 0.659 17 2 0.18 0.04 0.10 0.68 0.65 ing the belief probability distributions (b(s 0 );:::;b(s 3 )) associated with clusters that lead to large pr F values shows that 70.18 % of large failure probabilities are associated with beliefs where the belief probabilities are mainly distributed over states s 1 2 S DL and s 3 2 S DL . This implies that the control action resulting from the policy estimated for beliefs with high probabilities assigned to s 1 2 S DL and/or s 3 2 S DL and beliefs with high probabilities assigned to states similar to s 1 2 S DL and s 3 2 S DL (i.e., H 1 state identified during online adaptation) is not sufficient for avoiding colli- sions (i.e., avoiding transitioning to a failure state). This confirms the refinement performed based on statistical analysis results and need for action a 3 in the model. 4.4.2 Adaptation and refinement in Lane-Changing Model After evaluating the performance of the adaptation and expansion technique (presented in section 4.3), the same procedure is applied to adapt the lane-changing POMDP model to a complex, risky, 123 and unsafe environment. In the original lane-changing environment, the goal was achieved by assuming that none of the other vehicle agents within the perimeter of the A V demonstrate risky and unsafe behaviors, such as speeding up to close the gap when A V has initiated lane-changing [1]. Based on this assumption, the original lane-changing POMDP was designed with two ter- minal states and two atomic actions. The first and last states (s 0 : Safe to change lanes and s 4 : Failure/Collision) were defined as terminal states, meaning that no transitions to other states were enabled from these states. In addition, the ”initiate lane changing” and ”stop” actions were consid- ered as atomic actions, meaning that once one of these actions were triggered by the A V (and the underlying POMDP), the action could not be changed until a full stop or full lane-changing was accomplished, which would ultimately result in terminating the lane-changing POMDP model. However, in the new environment, the assumptions about behaviors of other vehicle agents in ad- jacent lanes are relaxed, meaning that risky and unsafe situations such as: ”vehicle behind the A V in the adjacent lane speeds up to close the gap when A V has initiated lane-changing action” are expected to happen in this environment. This implies that employing the original lane-changing POMDP model in the new environment will always lead to collisions and failures if such situa- tions are encountered during the lane-changing, since the ”initiate lane changing” action and ”safe to change lanes” state were designed as atomic actions and terminal states, respectively. To per- form adaptation and refinement for the original lane-changing model, risky and unsafe behaviors within a new multi-lane freeway environment are simulated in CARLA simulation environment and data (e.g., distances between A V and cars in front and behind in the adjacent lane, relative velocities, and TTC) is collected from 12 different scenarios. Severity of risky behaviors (e.g., acceleration rate of the other vehicle agent behind the A V in adjacent lane, size of the gap in adja- cent lane during lane-changing) in the simulated scenarios depend on various scenario parameters (e.g., initial location of vehicles, velocities and changes in velocities, and maximum traveled dis- tance). Data collected during the adaptation via expansion where possible new and missing states are initialized based on similarity with existing states indicates that the new state H 0 0 and estimated dynamics are highly similar to the ”Safe to change lanes” (as this situation happens immediately 124 after transitioning to s 0 2 S LC ), ”Unsafe to change lanes”, and ”Safe to change lanes, not bene- ficial (slower)” states within the state-space of the original POMDP and the estimated policy for the new state indicates ”change lanes” or ”maintain status quo” when the model transitioned to H 0 0 . Figure 4.18 provides the distribution of observations before (left hand side) and after (right hand side) initiating the lane changing action based on the following features: distance-first (the measured distance between A V and vehicle behind in the adjacent lane), distance-third (the mea- sured distance between A V and vehicle in front in th adjacent lane), TTC-first (time-to-collision between A V and vehicle behind in the adjacent lane), and TTC-third (time-to-collision between A V and vehicle in front in the adjacent lane). Performing ”speed up”, ”maintain status quo”, or ”change lanes” in H 0 in the simulated scenarios resulted in crashing with the vehicle behind the A V in the adjacent lane (maintain status quo when lane changing is initiated) or collision with the front vehicle in the adjacent lane (speed up to catch up with the vehicle behind the A V), indicating the need for further refinements in the adapted lane-changing model. To this end, the following refinements are applied: • ”Safe to change lanes” state is non-terminal (transitions from this state are enabled) • ”Initiate lane changing” action is non-atomic (lane changing can be aborted) • ”Lane changing initiated, but unsafe” state added to model (model can transition from the ”safe to change lanes” state to this state after performing action ”initiate lane changing”) • ”Abort lane-changing” action added to the action-space • ”Dynamics of the new state and action” are initialized similar to ”Unsafe to change lanes” state and fine-tuned with respect to a series of observations-actions expected from the new environment After refining the adapted lane-changing model, both refined lane-keeping and lane-changing models are integrated with a simulation environment in CARLA, where other vehicle agents in the new environment demonstrate risky and unsafe behaviors, such as: sudden stopping on the lane, 125 Figure 4.18: Results of data analysis performed on data collected during adaptation via expansion in lane- changing model cutting off to other lanes, and speeding up to close the gap when A V is initiated lane-changing action. The diagram in figure 4.19 provides an overview of the integration process. Required functions and algorithms for this integration process are implemented as part of the ”POMDP- con- troller.py” Python API in CARLA. In this simulation, the raw observations (e.g., distances, relative velocities, and TTCs associated with all vehicles within the perimeter of the A V) are obtained at every time-step (time-step = 0.075 seconds (hardware specs: Core i7 CPU @ 3.4GHz x 8 (Ubuntu 18.04), Graphics: NVIDIA GeForce RTX 2080 SUPER )) from the simulation. Initially, the raw 126 Figure 4.19: Overview of integrating adapted and refined lane-keeping and lane-changing models with a risky, unsafe, complex environment and use-case scenario simulated in CARLA observations are fed into a high-level tactic planning model (i.e., trigger-POMDP()implemented as a decision-tree) to identify which POMDP model (lane-keeping or lane-changing) should be trig- gered. These observations, depending on the enabled POMDP, are later processed within another function to identify which state(s) of the triggered POMDP these observations refer to (i.e., which states can be inferred from this observation) (i.e., Observation-labels() function). In the next step, the observation is fed into the triggered POMDP and N-Step Look-Ahead (N = 3 selected given the sampling rate of the simulation) that employs parallel processing for policy estimation. Lon- gitudinal (for throttle control given target speed) and latitudinal (for steering control) controllers 127 are implemented to estimate the steering and throttle values based on the obtained waypoint and acceleration (or deceleration), respectively. Finally, the throttle and steering values are fed into vehicle controller to be executed in the environment. Figure 4.20 provides the behavior of vehicle agents and the A V actions for one of the simulated scenarios in CARLA. Figure 4.20: Example of obtained behavioral results from CARLA-POMDP integration, A V is labeled as Ego in this figure In this simulation, the A V initially performs lane-keeping until the lane-changing POMDP is triggered, where the A V has to abort lane changing as the vehicle behind the A V in the adjacent lane speeds up to close the gap. After the A V successfully aborts lane changing and returns to its lane, it realizes that the other vehicle agent in front is stopped on the lane, thus brakes with maximum acceleration rate to avoid collision with this vehicle. 128 4.5 Policy Transfer in Expandable-Compact POMDP Models To perform policy transfer in Expandable-Compact POMDP models, an approximate technique and algorithm is implemented that employs a mapping function to identify how the states and actions from the source (original) POMDP model and environment are mapped to states and actions of the target POMDP model and environment. In other words, this technique assumes that the source POMDP model and Q-values, and target POMDP model (e.g., states, actions, dy- namics of the environment) are given; thus, it becomes feasible to adapt/update the source pol- icy (i.e., Q-values) to the target task/environment. This technique allows for transfer between tasks/environments that have extra states/actions and slightly different environment dynamics. The mapping function is used for estimating an approximate belief-space (i.e., collection of possible be- lief probabilities expected from ideal system-environment interactions [107]) for the target model based on the source belief-space and also for initializing target Q-values, where customized Q- learning [46] algorithm is employed to update Q-values for the target model. The policy transfer problem addressed here can be summarized as follows: GivenfEnvironment, Task;POMDP;pg source andfEnvironment 0 ;Task 0 ;POMDP 0 g target find p 0 for POMDP’ such that Task’ can be accomplished in Environment’. The main idea behind the policy transfer technique is to use a mapping function and estimated Q-values from the source POMDP to learn and estimate new Q-values for POMDP’ (i.e., target POMDP) using the customized Q-learning algorithm. The important assumption in this algorithm is that the belief-space and Q-values for the source model are available. This technique achieves policy transfer in 3 steps: Step 1: Define a mapping function Similar to majority of TL techniques and algorithms, the implemented policy transfer algorithm requires defining a function that maps the target states and actions to the source state and action-spaces. The general form of the mapping function (e.g., for states mapping) is provided in equation 4.14. 8s 0 2 S POMDP target : f(s 0 )!fw i s i g i2f1;2;:::jS POMDP source g (4.14) 129 where f(s 0 ) denotes a mapping function that maps a target state to at least one source state fw i s i g i2f1;2;:::jS POMDP source g shows a weighted set of source states to be mapped to, and w i (Sw i = 1) identifies the similarity measured between s 0 and s i . This implies that the mapping function can be defined such that a target state can map to exactly or at least one source state. For instance, if a target state s 0 equally maps to s 1 and s 2 from the source state-space, the mapping for these states can be defined as: f(s 0 )!f0:5s 1 ;0:5s 2 g. Using this mapping function,the beliefs within the source belief-space can be augmented/compressed if there are additional/less states in the target state-space. Step 2: Estimate belief-space for target POMDP Policy transfer using this technique is en- abled by transferring Q-values estimated for the source POMDP to the target model so that target policies can be estimated based on the target Q-values. For this purpose, the customized Q-learning algorithm is employed for estimating Q-values for the original (source) model. To enable policy and Q-value transfer, the belief-space for the target model should be estimated and discretized by finding clusters of beliefs with similar distributions (re-defining states for customized Q-learning). To achieve this, a policy is required to be followed so that the estimated belief-space for the target model can represent the expected behavior of the system (and model) in its environment [107]. However, the policy at the target environment is currently missing for the target model (we want to estimate the policy for the target model and environment). To address this issue, the target belief-space is approximated based on the source belief-space, policy, and the defined mapping function. For this purpose, the belief at time-step t = 0 is initialized for the target POMDP to be a normal distribution over all target states (b 0 t=0 = 1 jS 0 j jS 0 j ). Next, given the state mapping function, the belief is augmented such that b 0 has the same dimension as beliefs from the source belief-space, where the augmented belief is then compared to centroids of available clusters within the source POMDPs belief-space to find the most relevant cluster. After finding the cluster, the estimated Q-value associated with the centroid of the cluster (available at the source belief-space) is used to find the best source action. Later, the action mapping function is employed to map the source action to an action from the target action-space. If the source action maps to more than one 130 action in the target action-space, random choice is employed to account for beliefs that result from performing these actions in the approximated target belief-space. Later, the selected/identified mapped action is employed to update the current target belief based on all possible observations (available in emission matrix/function), where all updated beliefs are added to a list (the list ulti- mately provides the approximated target belief-space). After adding the estimated beliefs to the list, one belief is randomly selected and the above procedure is repeated until a terminal condition is met. To improve the estimated belief-space, the above process is repeated, and all estimated beliefs are stored. The pseudo-code for this algorithm is provided in the Appendix. Step 3: Estimate Q-values for target POMDP using customized Q-learning and estimated target belief-space Finally, in the last step, the approximated target belief-space from step 2 is discretized into clusters of beliefs with similar distributions to re-define states for the customized Q-learning. For this purpose, K-means clustering algorithm is employed where k (i.e., number of clusters) is definedµjS 0 jjA 0 jjW 0 j. After identifying the clusters, the Q-values are initialized for the customized Q-learning on target POMDP and environment. If no transfer is employed, the Q-values are initialized as zeros. Instead of initializing with zeros, the estimated Q-values from the source are employed. In other words, similar to belief-space generation procedure, the similarity between the (augmented) target belief cluster centroids and the centroids of clusters within source belief-space are measured and if there exists a cluster in the source belief-space with high similarity score, the source estimated Q-values for this cluster are employed to initialize the target Q-values. (the same Q-values from the source are directly for initializing Q-values for8a2fA\ A 0 g). Q- values for belief clusters with no similarity to source belief clusters are initialized as zeros. If there exists an extra action in the target action-space, for which there is no pre-estimated Q-value, the average value of Q-values associated with that state and all actions is employed to initialize Q-value for the extra action as shown in equation 4.15 [101, 102]. Q(c(b 0 2 B 0 );a 0 ) a 0 2A 0 = 2A = S 8a2A Q(c(b similar 2 B);a) jAj (4.15) 131 whereQ(c(b 0 2 B 0 );a 0 ) a 0 2A 0 = 2A shows the Q-value associated with the belief cluster (re-defined state for customized Q-learning) within the approximated target belief-space and action a 0 that is not available in the source model, S 8a2A Q(c(b similar 2 B);a) shows the average of Q-values for the belief cluster in the source belief-space (that is similar to the current belief at the target belief- space) and all source actions, and finally,jAj shows the size of the source action-space. To evaluate the performance of the policy transfer technique and algorithm, this technique is applied to transfer the original lane-keeping and lane-changing policies (in a multi-lane freeway with no risky behaviors) to the new multi-lane freeway environment, where risky behaviors exists. The main assumption in this experiment is that the source POMDP models, associated Q-values (estimated using customized Q-learning in chapter 3) and target POMDP models (target lane- changing model:fS 0 =fs 0 0 :Safe to change lanes, s 0 1 :Safe, not beneficial (slower), s 0 2 : Safe, not beneficial (faster), s 0 3 :Unsafe to change lanes, s 0 4 : Failure/collision, s 0 5 : Initiated lane changing, but unsafeg;A 0 =fa 0 0 : Maintain status quo, a 0 1 : Speed up, a 0 2 : Slow down, a 0 3 : Change lanes, a 0 4 : Stop, a 0 5 : Abort lane changinggg, target lane-keeping modelfS 0 =fs 0 0 : Slower than traffic, s 0 1 : Faster than traffic, s 0 2 : Safe/nominal, s 0 3 : About to crash, s 0 4 : Failure/collisiong;A 0 =fa 0 0 : Maintain status quo, a 0 1 : Speed up, a 0 2 : Slow down, a 0 3 : Brake with max acceleration rategg) are available and this information is employed for defining mapping functions. Figure 4.21 provides the state and action mapping functions for the lane-keeping and lane-changing models. After defining the mapping functions, these functions are employed for approximating the target belief-spaces for the target lane-keeping and lane-changing models. In total, 230K and 39K beliefs with 27 and 75 clusters are identified for the target lane-keeping and lane-changing models, respectively. To identify the clusters of similar beliefs in the approximated target belief-spaces, the K-means clustering algorithm is employed. As described above, the next step in policy transfer is associated with initializing the target Q-values. In this experiment, the similarity between the target and source beliefs is measured using a weighted Euclidean distance, where weights associated with states are the weights defined in the mapping function. In both experiments, similarity threshold is defined as Sim thresh = 0:1, meaning that if there exit a source belief centroid (i.e., re-defined state 132 Figure 4.21: State and action mapping functions for the lane-keeping (up) and lane-changing (bottom) models for customized Q-learning) whose weighted Euclidean distance from the target belief centroid is less than the threshold, the source Q-values estimated for that source belief can be employed for initializing the Q-values for the target belief. If there exists no source belief centroid with high similarity (weighted Euclidean distance 0:1), the Q-values are initialized as zeros. After initializing the target Q-values based on the source Q-values and mapping functions, Q-learning is employed to estimate the target Q-values. The number of episodes employed in Q- learning is 1500 and the results obtained from transfer for both models are provided in figure 4.22. The value and benefit of transfer using this technique is measured in terms of total reward achieved after transfer, jumpstart, and asymptotic performance by comparing the sum of collected rewards via transfer to sum of collected rewards when Q-values for the target POMDP are estimated from scratch. As shown in this figure, Q-learning with transferred Q-values and policies achieves a higher overall reward compared to estimating Q-values from scratch (Q-values initialized as zeros). In addition, using transfer in this experiment results in a notable jumpstart for both models. Also, Q-learning with transferred Q-values converges to the maximum collected reward faster than no 133 Figure 4.22: Long-term sum of rewards in transfer vs. no transfer for the lane-keeping (left) and lane- changing (right) models. Value of transfer is shown in terms of collected total reward after transfer, asymp- totic performance, and jumpstart transfer (lane-keeping converges around the 700th episode and lane-changing converges around the 850th episode), which implies that optimal Q-values can be obtained much earlier in the Q- learning process (i.e., lower number of episodes is required). To further verify the results, a few number of beliefs with high probabilities assigned to s 0 3 : About to crash (from the target lane- keeping model) and s 0 5 : Lane changing initiated, unsafe (from the target lane-changing model) are randomly selected and the Q-values are tracked during the Q-learning. Figure 4.23 and table 4.4 provide the estimated Q-values for these beliefs. Table 4.4: Estimated Q-values during transfer for target beliefs with high probability assigned to state s 0 5 : Lane changing initiated, unsafe from lane-changing model Sampled Target Beliefs Transferred Q-values for8a 0 2 A 0 [0.038, 0.020, 0.019, 0.025, 0.045, 0.852] [5.056, 2.055, 3.620, -5.573, -1000, 14.104] [0.261, 0.022, 0.022, 0.032, 0.349, 0.315] [0.592, -5.248, -4.730, -1.532, -1000, 2.478] [0.039, 0.015, 0.016, 0.0109, 0.165, 0.754] [2.208, -3.019, -3.816, -4.708, -1000, 7.401] [0.255, 0.002, 0.002, 0.002, 0.167, 0.570] [0.0, -0.829, -0.794, -1000, 12.86] 134 Figure 4.23: Estimated Q-values during transfer for target beliefs with high probability assigned to state s 0 3 : About to crash from lane-keeping model As shown in figure 4.23, performing action a 0 3 : Brake with max acceleration in beliefs where belief probability associated with s 0 3 : About to crash (from the target lane-keeping model) is high, leads to the highest Q-value, implying that the transferred policies for beliefs with high probability of ”about to crash” leads to braking with maximum acceleration rate when A V encounters late- reveal situations in the complex, risky lane-keeping environment. The estimated belief-action Q-values (using transfer) for the lane-changing model (presented in table 4.4) show that the action a 0 5 : Abort lane changing has the highest Q-value for those beliefs where there is a (slightly) high probability assigned to state s 0 5 : Lane changing initiated, unsafe. This implies that the target lane- keeping POMDP model executes ”Abort lane changing” immediately after transitioning to state ”Lane changing initiated, unsafe”. Verifying the transferred policies in a simulation environment (e.g., CARLA) for these models remains as part of the future work in this research. 4.6 Summary and Conclusion This chapter in this thesis focuses on addressing adaptation and refinement in POMDPs. As dis- cussed in this chapter, majority of the POMDP models are designed for a specific environment, scenario, and objective based on limited data available at the outset. In other words, majority of the 135 available POMDP models employed for planning and decision-making problems assume all data required for formalizing the problem as a POMDP is available at the outset, thus they design and formulate POMDPs using a fixed set of states, actions, and observation spaces and pre-defined dy- namics. However, the dynamic nature of environments and complexity of the system-environment interactions implies that there may be situations resulting from new, unknown information and changing environment dynamics that are revealed from the environment that are not addressed in the initial POMDP formulation. For this purpose, POMDPs need to get updated and adapt to ac- count for new information and dynamics as they are revealed from the environment, otherwise the unknown information can lead to inaccurate state inferences and poor model performance. There exist various techniques, such as re-planning, modeling all possible changes as state vari- ables, and transfer knowledge that mainly focus on adapting the estimated policy of the designed model to new dynamics and information. Recent studies and available related research work in POMDP adaptation focus on transferring gained experience and knowledge from employing the original model in the initial environment to the new environment. This technique is knowns as transfer learning in the available literature. Existing transfer learning techniques for POMDPs can be classified to the following categories: 1 - Policy adaptation, and 2- Model/data augmentation. While policy adaptation techniques and algorithms mainly focus on adapting the estimated pol- icy online by modifying the estimated belief-tree and estimated value function, the model/data augmentation-based techniques perform adaptation by augmenting the model and existing data. In general, the main assumption in transfer learning is that some information about the source and tar- get environments and tasks are available, where this information is employed to perform efficient knowledge transfer. The main limitation associated with policy adaptation algorithms and tech- niques is that these techniques only address adaptation in the estimated policy and do not account for adaptation in the model itself. In addition, these techniques can only handle specific changes in the environment, such as changes in objectives. On the other hand, there exists only limited work associated with adaptation via model/data augmentation that mainly focuses on adaptation in spoken dialogue systems. 136 To enable model (and policy) adaptation in POMDPs and address the existing gaps and lim- itations, an adaptation and refinement technique and algorithm is presented in this chapter. This technique builds upon the core ideas of transfer learning via model/data augmentation and enables adaptation and refinement in Expandable-Compact POMDPs in 2 phases: 1 - Online Adaptation via Expansion, and 2- Offline Post-Expansion Refinements. This technique is a hybrid model- based, data-driven technique that uses various visual and statistical data analysis techniques to perform adaptation and refinement. The main assumption in the initial phase, which is performed online while model interacts with the environment, is that the new, unknown information is mainly associated with possible missing states, observations and dynamics in the model, thus attempts to dynamically and gradually add new states (and associated observations) to the model and estimate underlying dynamics by measuring the similarity between observed information and existing data as new information becomes available from the environment. To account for model scalability and maintain enough size and complexity in the model to achieve near real-time performance, the adaptation via expansion algorithm employs a labeling function to avoid initialization of redundant states. The post-expansion refinement (phase 2) which is performed after adapting the model, em- ploys various statistical and visual analysis techniques on performance and collected data during adaptation to identify further inaccuracies and apply refinements. The refinements can be associ- ated with discarding a subset of states, joining states, updating observation function, adding new actions, updating heuristics, and fine-tuning model parameters. It is important to note that the ap- plied refinements (second phase), such as adding new actions should be verified with respect to constraints and requirements (e.g., system physical constraints). The adaptation and refinement technique and algorithm provided in this chapter are employed for adapting and refining the lane-keeping and lane-changing POMDP models for safe, smooth, and collision-free planning and decision-making in more complex, risky, and unsafe environments and use-case scenarios, where the other vehicle agents in the perimeter of the A V demonstrate risky behaviors such as: sudden stopping on the lane, cutting off to other lanes, and speeding up to close the gap when lane changing is initiated. The adaptation and refinement technique applied to the 137 lane-keeping model is discussed in details and the adapted and refined model(s) is evaluated in 138 scenarios simulated in CARLA. For comparison purposes, the obtained performance is compared to policy transfer and a rule-based decision-making technique. In addition, a probabilistic model- based technique is employed to verify the refined and adapted model by calculating the extreme and expected probability of failure for the model. Later, the same procedure is applied to adapt and refine the lane-changing POMDP model. Finally, both adapted and refined models are integrated with the simulated environment in CARLA and the results are presented. For evaluation and comparison purposes, a policy transfer technique and algorithm is also provided in this chapter. The main assumption in this technique, similar to majority of transfer learning techniques and algorithms, is that the source and target models/environments are avail- able a-priori. This information is employed for defining a mapping function that maps the target states and actions to source states and actions. The mapping function is later used for approx- imating the target belief-space based on the source belief-space and initializing target Q-values from estimated source Q-values. Customized Q-learning is then employed for estimating target Q-values and finding optimal policies for the target POMDP and environment. This technique al- lows for transfer between models and environments with additional states and actions. The policy transfer technique and algorithm is employed to transfer policies (i.e., Q-values) from the original lane-keeping and lane-changing models, initially designed for a simple multi-lane freeway with no risky behaviors to high-risk and unsafe environments where risky behaviors such as stopping on the lane, cutting off to other lanes, and speeding up to close the gap when Av initiates lane changing are expected from other vehicle (agents) in the environment. The value of transfer for these models are demonstrated in terms of the total reward collected with transfer, asymptotic performance, and jumpstart. 138 Chapter 5 Summary and Future Directions Complex systems, such as A Vs and UA Vs typically operate in dynamic, uncertain, partially ob- servable, and reactive environments, where they continuously sense the information from system- environment interactions, process the information to infer underlying states from partially avail- able observations, and plan accordingly. This can be achieved by designing and employing be- havioral models that represent system-environment interactions and use these models for process- ing observations and making decisions (responding to observations from the environment). The designed models need to account for the existing uncertainty and partial observability in the en- vironment while adhering to time and computational constraints posed by the environment and system (e.g., real-time data processing and decision-making in presence of limited computational resources). Moreover, due to the dynamic nature of the environment and complexity of the system- environment interactions, there may be situations that are revealed from the environment after the model is deployed in the environment. This implies that the models should be able to efficiently account for the new, unknown information (resulting from unknown situations or changing dynam- ics in the environment) as it reveals from interacting with the environment. In other words, models need to scale up efficiently to become a true representation of the system-environment interactions while maintaining enough size and complexity to achieve near real-time performance Partially Observable Markov Decision Process (POMDP) is a model-based technique for ad- dressing continuous planning and decision-making problem in presence of uncertainties and partial observabilities. A POMDP model captures the behavior of a complex system in its environment 139 by defining state and observation spaces and estimating underlying behavioral dynamics, where the decisions are embedded within model as actions and behavior of system in the environment is expressed in terms of action-enabled transitions between states. POMDPs account for partial observability by probabilistically inferring the underlying state (belief) from partial observations, where the uncertainty is addressed by employing probability distributions to represent underly- ing dynamics (i.e., transitions and emissions). The overall goal in POMDP models is associated with finding an optimal mapping between beliefs (i.e., probabilistically inferred states) and ac- tions (optimal policy) such that the objective in the environment can be achieved by performing those actions in the environment. POMDP models have been successfully employed for addressing planning and decision-making in various scenarios and environments, such as tactical and motion planning in autonomous vehicles in intersections, lane-changing, and pedestrian avoidance; and path planning and navigation in unmanned aerial vehicles for object detection and tracking and search and rescue. Although POMDP modeling technique is a theoretically and mathematically powerful technique for addressing planning and decision-making problem in uncertain dynamic and reactive environments, this modeling technique is rarely employed in real-world applications of autonomous planning and decision-making. This is due to the fact that POMDP models suf- fer from scalability issues mainly due to the probabilistic nature of the models (i.e., POMDPs are P-SPACE Complete) and employing large state-spaces. For this purpose, majority of the POMDP- related work focuses on using approximate solutions to address the scalability issues. However, majority of this work is only associated with addressing scalability issues in policy estimation, not the POMDP model formulation. In addition, majority of the existing POMDP work in the available literature formulate POMDPs for a specific environment and use-case scenario while accounting for a subset of uncertainties. In other words, the available POMDP models assume that all informa- tion required for formalizing the planning and decision-making problem as a POMDP is initially available at the outset, and design POMDPs with a fixed set of states, observations, action spaces and pre-defined dynamics. However, this is not a valid assumption since there may be situations (e.g., new dynamics or unknown information) that are only revealed when system interacts with its 140 environment (i.e., unknown unknowns (information that we don’t know that we don’t know)). In the light of foregoing, there is a clear need for scalable, adaptive probabilistic models, a.k.a scal- able, adaptive POMDP modeling techniques, that 1- efficiently accounts for partial observability and uncertainty available in the environment, 2- address scalability issues in model formulation and policy estimation, and 3- enable adaptability so that models can gradually learn and account for new information as it becomes available from the environment. Based on the discussion above, the hypothesis in this research can be summarized as follows: Certain approximations can be introduced to POMDP modeling using heuristics, data analysis and machine learning techniques to realize POMDP modeling for planning and decision-making in real-world problem domains and environments by: • Enabling gradual model adaptation and refinement to scale-up models to become accurate representation of true system-environment interactions without adding unnecessary com- plexity • Containing state-space explosion and reducing computational load to achieve near real-time performance For this purpose, this thesis builds up on the existing POMDP models, policy estimation, and adaptation techniques and employs data and model-driven heuristics, machine learning and data analysis techniques to introduce certain approximations that enable scalable, adaptive POMDP modeling for real-world problem domains and environments. Specifically, the research work pre- sented in this thesis provides a modeling technique, ”Expandable-Compact POMDPs”, implements an online, adaptive policy estimation algorithm, ”N-Step Look-Ahead”, and provides an adaptation and refinement technique and algorithm, ”Online Adaptation via Expansion” and ”Post-Expansion Refinement” for POMDP models formulated using the Expandable-Compact POMDP modeling technique. Chapter 2 in this thesis discusses the Expandable-Compact POMDP modeling technique. This technique mainly focuses on addressing the scalability issues in model formulation and design (i.e., 141 state-space definition and level of addressed planning and decision-making). The term ”Compact” in this technique implies that the state-space of the model is a compact representation. Basically, to define a compact state-space, unsupervised (e.g., clustering - when data is not labeled) or su- pervised (e.g. classification - when labeled data is available) is employed to find patterns and/or clusters of similar situations in data obtained from system-environment interactions and the iden- tified clusters or patterns are employed to represent states. The state-space definition also includes patterns/clusters of observations indicating failure and goal situations, which simplifies the for- mulation of a reward function. In addition, since states are approximated using patterns/clusters, transition and emission function formulations are also simplified (e.g., matrices of probabilities instead of complex dynamic motion equations). On the other hand, the term ”Expandable” in this technique implies that the defined state, action, and observation spaces can be expanded and re- fined to account for new information as it becomes available from system-environment interactions. The Expandable-Compact POMDP is defined using a tuple < S + ;A + ;W + ;T + ;O + ;R + >, where S + denotes the expandable-compact state-space, A + denotes high-level action-space (including actions for tactical level decision-making),W + shows the expandable observation-space where ob- servations belong to existing clusters/patterns, T + =(s i ;a;s j ) represents the transition matrix that provides the probability associated with performing action a in s i and transitioning to s j , where the dynamics can be expanded to account for transitions to/from newly added states, similarly, O + =(s i ;a;o j ) shows the emission matrix that provides the emission (observation) probability for performing an action a in state s i and observing o j , and finally R + : S + !R shows the extended re- ward function that directly assigns reward/penalty values to the states depending on their category (i.e., goal, transient, and failure). In addition to scalability and adaptability, presenting states using clusters/patterns also enables performing data analysis techniques on available clusters to identify data-driven relationships and dependencies between states, such as: identifying which states are similar and measuring similarity/difference between states and clusters. This information can be used for introducing approximations and defining heuristics that can be employed for adaptation, refinement, and near real-time policy estimation. Later in this chapter, the Expandable-Compact 142 POMDP modeling technique is employed for defining POMDP models for safety-critical applica- tions of autonomous vehicles in dynamic and uncertain environments, that are: lane-keeping and lane-changing POMDP models. These models are the basis of all experimentations presented in this thesis. Chapter 3 in this thesis addresses policy estimation problem in POMDPs. An overview of the theoretical background of policy estimation problem, existing techniques and algorithms (in- cluding exact offline solvers and approximate online algorithms), and advantages and limitations are discussed in this chapter. The focus of the policy estimation algorithms is associated with ad- dressing scalability issues to achieve near real-time performance with policy estimation. Although, online algorithms are much faster compared to offline techniques and they can account for some changes in the environment without estimating the policy from scratch, the most effective and ac- curate online algorithms require some information, such as samples of different behavioral paths and pre-offline calculations to be available a-priori. In other words, the a-priori information is employed to introduce approximations for addressing the existing time-performance trade-offs in online algorithms. Moreover, majority of these algorithms are implemented for motion-planning and path-planning POMDPs and configuring these algorithms for other types of POMDP formu- lations is highly time-consuming and challenging. For this purpose, an online, adaptive policy estimation algorithm, N-Step Look-Ahead, is implemented and presented in this thesis that can es- timate policies online (when model is executed in the environment) without relying on a-priori. To achieve near real-time performance and address time-performance trade-offs, this algorithm em- ploys heuristics and unsupervised-learning (i.e., distance-based clustering) to prune sub-optimal solutions, avoid redundant calculations, and guide the search for optimal policy towards valuable paths. Similar to existing online algorithms and techniques, this algorithm finds the policy based on the most recent belief obtained from system-environment interactions by constructing and travers- ing a belief-tree of depth N (current belief is located at the root node of the tree and nodes and paths in the other levels of the tree are estimated from the root belief based on available actions and ob- servations). The N-Step Look-Ahead algorithm builds and traverses the tree in a recursive manner. 143 The heuristics employed in this algorithm indicate that only non-terminal belief nodes with high probabilities (i.e., reachable beliefs from parent belief node) are expanded and evaluated. In addi- tion, as the search for optimal policy continues, the values of explored paths in the belief-tree and the belief probabilities of evaluated nodes are stored and distance-based clustering is employed so that the algorithm does not expand the (similar) belief-nodes for which the value is already cal- culated and is available. It is shown that the employed heuristics and distance-based clustering reduces the exponential search time associated with a full tree to almost linear computation time. Later in this chapter, the N-Step Look-Ahead algorithm is employed for policy estimation for the lane-keeping and lane-changing models, where the performance is compared to end-to-end (imple- mented using deep learning - Neural Networks) and rule-based (implemented using decision-trees) decision-making techniques. For comparison purposes, Q-learning algorithm is employed to es- timate offline policies for the lane-keeping and lane-changing POMDP models and the obtained policies are compared to policies estimated online using the N-Step Look-Ahead algorithm. The results obtained from these comparisons indicate that the N-Step Look-Ahead can find optimal policies by looking a few steps ahead in the future (i.e. N), where N can be identified experi- mentally depending on the model size and sampling rate from the simulation/real environment. Moreover, to demonstrate the applicability of the N-Step Look-Ahead in other problem-domains, this algorithm is employed for path-planning in a target search use-case scenario (i.e., finding in- dication of ancient water on Mars surface using Mars Helicopter). To further reduce the policy estimation time using this algorithm, parallel processing is implemented so that the belief-action value for the current belief and all existing actions can be calculated in parallel, reducing the over- all policy estimation time to T(policyestimation) jAj . Finally, chapter 4 in this thesis addresses adaptation and refinement in POMDPs. An overview of the existing adaptation techniques, including transfer learning, applications, advantages and limitations are discussed in this chapter. Later in this chapter, adaptation via expansion and post- expansion refinement algorithm and technique are presented, and the theory and implementation 144 process are discussed in details. This technique builds upon the core idea of transfer knowledge (re- using gained experience from deploying a model in an environment for another model in a similar, but different environment) and uses data analysis techniques to enable gradual model adaptation and refinement. Specifically, data analysis techniques are employed to formulate a similarity func- tion based on which possible new states (resulting from new, unknown information) are gradually added to the model and the associated underlying dynamics are estimated based on the measured similarity with the existing states and observation clusters. To account for scalability during adap- tation, a labeling function is implemented that avoids initialization of redundant states, where the new information is used for updating the old estimations instead of expanding the model. Later, in the post-expansion refinement phase (applied after online adaptation), visual and statistical data analysis are performed on the collected data during adaptation (i.e., data associated with new states) and performance data of the adapted model to isolate and reason about inaccuracies and inefficiencies in the model. Based on these results, further refinements are performed to increase the performance and accuracy of the adapted models. Later in this chapter, the adaptation via expansion and post-expansion refinement algorithm and technique are employed for adapting the original lane-keeping model to a more complex and unsafe environment and use-case scenario, where risky behaviors such as sudden stopping on the lane and cutting off to other lanes are ob- served from other vehicle agents within the perimeter of the A V . The details of the adaptation and refinement process in this model are discussed and the refined model(s) is evaluated in 137 scenar- ios simulated in CARLA. In this experiment, for comparison purposes, a policy transfer technique using Q-learning (i.e., transferring Q-values) is also implemented and integrated with the CARLA simulation environment. The results obtained from evaluating the models (adapted and refined vs. transferred policy) indicate that the adapted and refined POMDP achieves better performance (in terms of policy estimation time, safety and robustness of decisions) compared to other techniques and models employed in this experiment. In addition, the results of probabilistic, model-based performance evaluation of the adapted and refined POMDP model confirms the results obtained using statistical data analysis techniques. Finally, the same procedure is applied to the lane-keeping 145 POMDP model and both adapted and refined POMDP models are integrated within a simulated complex and unsafe environment in CARLA, where risky behaviors such as sudden stopping on the lane, cutting off to other lanes, and speeding up to close the gap when A V initiates lane chang- ing action are all simulated simultaneously in the simulated scenario and environment. The results of this experiment indicate that adapted and refined POMDP models and estimated policies using N-Step Look-Ahead for these models are able to perform safe and collision-free maneuvers in this environment. Possible future directions in this research can be associated with formalizing various tacti- cal planning problems, such as pedestrian avoidance, intersection crossing, and unprotected left turn using Expandable-Compact POMDP modeling technique. Currently, majority of the research focuses on motion planning using POMDPs for these problems, where continuous state-space rep- resentation and 2D dynamic motion equations are employed in POMDP formalization. Another possible direction in this research can focus on addressing the performance-time trade-offs in the N-Step Look-Ahead algorithm. Specifically, the N-Step Look-Ahead algorithm uses memory and distance-based clustering to identify what belief nodes and paths are evaluated in the belief-tree and checks if value is available in memory by measuring the Euclidean distance between the cur- rent belief node (to be evaluated) and all belief nodes stored in memory, given the level/depth the node is located at. To speed up the search for available beliefs and values, a customized hash table data structure can be implemented, where the keys are combination of node level in the tree and belief probability. Since the hash table data structure, when matching a query to existing keys, looks for exact matches between the query and keys, thus using belief probabilities and node levels as keys within a hash table will result in numerous number of keys even if the belief probabilities are slightly different. One possible solution is to use ”lambda functions” (a capability in Python) to customizes key-query matching in a hash table. This will reduce the search for existing belief and values to O(1), further speeding up the search for optimal policy. Finally, another possible future direction in this research can be associated with improving the accuracy of dynamic estimations in 146 the adaptation via expansion algorithm. This algorithm uses heuristics to decide which informa- tion can be used for updating the old estimations for the newly initialized states. To improve the accuracy of estimations, a cost function can be defined and used instead of the heuristic criterion. Furthermore, to verify and validate the feasibility of actions (e.g., brake with maximum accelera- tion rate or abort lane changing), physics models will be defined and employed (see appendix for an example of physics-based verification and validation). 5.1 Impact on MBSE and Systems Engineering Society As technology advances, systems continue to grow in scale and complexity, where these systems usually operate in reactive dynamic environments where various sources of uncertainties are avail- able. This implies probabilistic behavior in these systems. In addition, complex systems are usu- ally exposed to variety of known and unknown information, which requires these systems and the models of the systems to be adaptive to gradually scale up and become a true representation of system-environment interactions. Traditional modeling tools and techniques cannot accurately address design and development of complex systems and there is a need for new modeling methods and tools that integrate tech- niques from different paradigms to enable accounting for probabilistic behavior, adaptability, and refinement through reasoning. The Expandable-Compact POMDP modeling method enables defin- ing probabilistic and adaptive models with compact state-spaces and achieves near real-time per- formance by integrating model-based techniques with heuristics and machine learning. Specifi- cally, the Expandable-Compact POMDP modeling technique builds up on the advantages and core ideas of probabilistic modeling, decision-making, and adaptation ideas and allows designing and implementing “live” models of complex systems, where the models can continue to learn from system-environment interactions, adapt to various situations in the environment, accurately repre- sent the behavior of complex system in their environments, and achieve near real-time performance in the environment by accurately and efficiently addressing scalability issues in both model design 147 and decision-making process. This modeling technique expands systems modeling and Model- Based Systems Engineering (MBSE) tools by integrating model-based techniques with heuristics, machine learning, and data analysis techniques and allows engineers and modelers to design partial models based on limited knowledge, intuitions, and experts judgment, reason about various errors and inefficiencies, isolate and visualize existing inaccuracies, and address scalability issues, and finally to make gradual refinements to model while controlling and managing the complexity of the model. 148 Bibliography [1] Parisa Pouya and Azad M Madni. Expandable-partially observable markov decision-process framework for modeling and analysis of autonomous vehicle behavior. IEEE Systems Jour- nal, 2020. [2] Azad M Madni, Michael Sievers, Ayesha Madni, Edwin Ordoukhanian, and Parisa Pouya. Extending formal modeling for resilient systems design. INSIGHT, 21(3):34–41, 2018. [3] Parisa Pouya and Azad M Madni. Leveraging probabilistic modeling and machine learning in engineering complex systems and system-of-systems. In AIAA Scitech 2020 Forum, page 2117, 2020. [4] Parisa Pouya and Azad M Madni. Probabilistic system modeling for complex systems op- erating in uncertain environments. In Recent Trends and Advances in Model Based Systems Engineering, pages 113–127. Springer, 2022. [5] Parisa Pouya and Azad Madni. Performing active search to locate indication of ancient water on mars: An online, probabilistic approach. In ASCEND 2021, page 4024. 2021. [6] Art´ ur Istv´ an K´ aroly, P´ eter Galambos, J´ ozsef Kuti, and Imre J Rudas. Deep learning in robotics: Survey on model structures and training strategies. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 51(1):266–279, 2020. [7] Thomas Collins and Wei-Min Shen. Autonomous learning of pomdp state representations from surprises. In 2018 4th International Conference on Control, Automation and Robotics (ICCAR), pages 359–367. IEEE, 2018. [8] Michael M Sievers, Azad M Madni, and Parisa Pouya. Assuring spacecraft swarm byzantine resilience. In AIAA Scitech 2019 Forum, page 0224, 2019. [9] Patrick Kalmbach, Johannes Zerwas, Peter Babarczi, Andreas Blenk, Wolfgang Kellerer, and Stefan Schmid. Empowering self-driving networks. In Proceedings of the afternoon workshop on self-driving networks, pages 8–14, 2018. [10] Yang Guan, Shengbo Eben Li, Jingliang Duan, Wenjun Wang, and Bo Cheng. Markov probabilistic decision making of self-driving cars in highway with random traffic flow: a simulation study. Journal of Intelligent and Connected Vehicles, 2018. [11] Xiaobin Zhang, Bo Wu, and Hai Lin. Counterexample-guided abstraction refinement for pomdps. arXiv preprint arXiv:1701.06209, 2017. 149 [12] Joseph D’Ambrosio and Grant Soremekun. Systems engineering challenges and mbse op- portunities for automotive system design. In 2017 IEEE International Conference on Sys- tems, Man, and Cybernetics (SMC), pages 2075–2080. IEEE, 2017. [13] Azad M Madni and Michael Sievers. Combining formal and probabilistic modeling in resilient systems design. Procedia Computer Science, 153:343–351, 2019. [14] Azad M Madni and Michael Sievers. Model-based systems engineering: Motivation, current status, and research opportunities. Systems Engineering, 21(3):172–190, 2018. [15] Joshua Robinson and Alexander Hartemink. Non-stationary dynamic bayesian networks. Advances in neural information processing systems, 21:1369–1376, 2008. [16] Makram Talih and Nicolas Hengartner. Structural learning with time-varying components: tracking the cross-section of financial time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):321–341, 2005. [17] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [18] Martin L Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331–434, 1990. [19] Matthijs TJ Spaan. Partially observable markov decision processes. In Reinforcement Learn- ing, pages 387–414. Springer, 2012. [20] Simon Ulbrich and Markus Maurer. Probabilistic online pomdp decision making for lane changes in fully automated driving. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pages 2063–2067. IEEE, 2013. [21] Simon Ulbrich and Markus Maurer. Towards tactical lane change behavior planning for au- tomated vehicles. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pages 989–995. IEEE, 2015. [22] Constantin Hubmann, Marvin Becker, Daniel Althoff, David Lenz, and Christoph Stiller. Decision making for autonomous driving considering interaction and uncertain prediction of surrounding vehicles. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1671– 1678. IEEE, 2017. [23] Weilong Song, Guangming Xiong, and Huiyan Chen. Intention-aware autonomous driving decision-making in an uncontrolled intersection. Mathematical Problems in Engineering, 2016, 2016. [24] Wei Liu, Seong-Woo Kim, Scott Pendleton, and Marcelo H Ang. Situation-aware decision making for autonomous driving on urban road using online pomdp. In 2015 IEEE Intelligent Vehicles Symposium (IV), pages 1126–1133. IEEE, 2015. [25] Haoyu Bai, Shaojun Cai, Nan Ye, David Hsu, and Wee Sun Lee. Intention-aware online pomdp planning for autonomous driving in a crowd. In 2015 ieee international conference on robotics and automation (icra), pages 454–460. IEEE, 2015. 150 [26] Zhiqian Qiao, Katharina Muelling, John Dolan, Praveen Palanisamy, and Priyantha Mu- dalige. Pomdp and hierarchical options mdp with continuous actions for autonomous driv- ing at intersections. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2377–2382. IEEE, 2018. [27] Maxime Bouton, Alireza Nakhaei, Kikuo Fujimura, and Mykel J Kochenderfer. Scalable decision making with sensor occlusions for autonomous driving. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2076–2081. IEEE, 2018. [28] Martin Mundhenk, Judy Goldsmith, Christopher Lusena, and Eric Allender. Complexity of finite-horizon markov decision process problems. Journal of the ACM (JACM), 47(4):681– 720, 2000. [29] Khin Thida San, Je Min Park, Seon Jeong Lee, and Yoon Seok Chang. Slam for automated unmanned ground vehicle with ros. In Symposium on Logistics (ISL 2018) Big Data Enabled Supply Chain Innovations, 2018. [30] Igor Cviˇ si´ c, Josip ´ Cesi´ c, Ivan Markovi´ c, and Ivan Petrovi´ c. Soft-slam: Computationally efficient stereo visual simultaneous localization and mapping for autonomous unmanned aerial vehicles. Journal of field robotics, 35(4):578–595, 2018. [31] Andreu Corominas Murtra and Josep Maria Mirats Tur. Map format for mobile robot map- based autonomous navigation. 2007. [32] Rainer K¨ ummerle, Michael Ruhnke, Bastian Steder, Cyrill Stachniss, and Wolfram Bur- gard. Autonomous robot navigation in highly populated pedestrian zones. Journal of Field Robotics, 32(4):565–589, 2015. [33] Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus. Planning and decision-making for autonomous vehicles. Annual Review of Control, Robotics, and Autonomous Systems, 2018. [34] Tobias Gindele, Sebastian Brechtel, and Rudiger Dillmann. Learning driver behavior mod- els from traffic observations for decision making and planning. IEEE Intelligent Transporta- tion Systems Magazine, 7(1):69–79, 2015. [35] Liangzhi Li, Kaoru Ota, and Mianxiong Dong. Humanlike driving: Empirical decision- making system for autonomous vehicles. IEEE Transactions on Vehicular Technology, 67(8):6814–6823, 2018. [36] Viktor Rausch, Andreas Hansen, Eugen Solowjow, Chang Liu, Edwin Kreuzer, and J Karl Hedrick. Learning a deep neural net policy for end-to-end control of autonomous vehicles. In 2017 American Control Conference (ACC), pages 4914–4919. IEEE, 2017. [37] Aparajit Narayan, Elio Tuci, Fr´ ed´ eric Labrosse, and Muhanad H Mohammed Alkilabi. A dynamic colour perception system for autonomous robot navigation on unmarked roads. Neurocomputing, 275:2251–2263, 2018. 151 [38] Peter Wolf, Christian Hubschneider, Michael Weber, Andr´ e Bauer, Jonathan H¨ artl, Fabian D¨ urr, and J Marius Z¨ ollner. Learning how to drive in a real world simulation with deep q- networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 244–250. IEEE, 2017. [39] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989. [40] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016. [41] Danilo Alves de Lima and Alessandro Corrˆ ea Victorino. A hybrid controller for vision- based navigation of autonomous vehicles in urban environments. IEEE Transactions on Intelligent Transportation Systems, 17(8):2310–2323, 2016. [42] Sonia Waharte and Niki Trigoni. Supporting search and rescue operations with uavs. In 2010 International Conference on Emerging Security Technologies, pages 142–147. IEEE, 2010. [43] Andrew Symington, Sonia Waharte, Simon Julier, and Niki Trigoni. Probabilistic target detection by camera-equipped uavs. In 2010 IEEE International Conference on Robotics and Automation, pages 4076–4081. IEEE, 2010. [44] Eseoghene Ben-Iwhiwhu, Pawel Ladosz, Jeffery Dick, Wen-Hua Chen, Praveen Pilly, and Andrea Soltoggio. Evolving inborn knowledge for fast adaptation in dynamic pomdp prob- lems. In Proceedings of the 2020 genetic and evolutionary computation conference, pages 280–288, 2020. [45] George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms. Management science, 28(1):1–16, 1982. [46] Parisa Pouya and Azad M Madni. A probabilistic online policy estimator for autonomous systems planning and decision making. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2933–2938. IEEE, 2020. [47] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009. [48] Nita Yodo and Pingfeng Wang. Engineering resilience quantification and system design implications: A literature survey. Journal of Mechanical Design, 138(11), 2016. [49] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision pro- cesses. Mathematics of operations research, 12(3):441–450, 1987. [50] St´ ephane Ross, Joelle Pineau, S´ ebastien Paquet, and Brahim Chaib-Draa. Online planning algorithms for pomdps. Journal of Artificial Intelligence Research, 32:663–704, 2008. [51] Darius Braziunas. Pomdp solution methods. University of Toronto, 2003. 152 [52] Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations research, 21(5):1071–1088, 1973. [53] Anthony R Cassandra, Michael L Littman, and Nevin Lianwen Zhang. Incremental pruning: A simple, fast, exact method for partially observable markov decision processes. arXiv preprint arXiv:1302.1525, 2013. [54] Nevin Lianwen Zhang and Weihong Zhang. Speeding up the convergence of value iter- ation in partially observable markov decision processes. Journal of Artificial Intelligence Research, 14:29–51, 2001. [55] Sven Koenig. Agent-centered search. AI Magazine, 22(4):109–109, 2001. [56] Dimitri Klimenko, Joshua Song, and Hanna Kurniawati. Tapir: A software toolkit for ap- proximating and adapting pomdp solutions online. In Proceedings of the Australasian Con- ference on Robotics and Automation, Melbourne, Australia, volume 24, 2014. [57] Hanna Kurniawati and Nicholas M Patrikalakis. Point-based policy transformation: adapt- ing policy to changing pomdp models. In Algorithmic Foundations of Robotics X, pages 493–509. Springer, 2013. [58] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Advances in neural information processing systems, 20, 2007. [59] Jonti Talukdar, Sanchit Gupta, PS Rajpura, and Ravi S Hegde. Transfer learning for object detection using state-of-the-art deep neural networks. In 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN), pages 78–83. IEEE, 2018. [60] Joseph J Lim, Russ R Salakhutdinov, and Antonio Torralba. Transfer learning by borrow- ing examples for multiclass object detection. Advances in neural information processing systems, 24, 2011. [61] Bulbul Bamne, Neha Shrivastava, Lokesh Parashar, and Upendra Singh. Transfer learning- based object detection by using convolutional neural networks. In 2020 International Con- ference on Electronics and Sustainable Communication Systems (ICESC), pages 328–332. IEEE, 2020. [62] Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas L Dean, and Craig Boutilier. Hierarchical solution of markov decision processes using macro-actions. arXiv preprint arXiv:1301.7381, 2013. [63] Caroline Carvalho Chanel, Florent Teichteil-K¨ onigsbuch, and Charles Lesire. Multi-target detection and recognition by uavs using online pomdps. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 27, 2013. [64] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006. 153 [65] Michael Sievers, Azad M Madni, Parisa Pouya, and Robert Minnichelli. Trust and reputation in multi-agent resilient systems. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pages 741–747. IEEE, 2019. [66] Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957. [67] Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In AAAI/IAAI, pages 541–548, 1999. [68] Trey Smith and Reid Simmons. Point-based pomdp algorithms: Improved analysis and implementation. arXiv preprint arXiv:1207.1412, 2012. [69] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013. [70] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003. [71] Matthijs TJ Spaan and Nikos Vlassis. Perseus: Randomized point-based value iteration for pomdps. Journal of artificial intelligence research, 24:195–220, 2005. [72] Pascal Poupart, Kee-Eung Kim, and Dongho Kim. Closing the gap: Improved bounds on optimal pomdp solutions. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 21, 2011. [73] Joelle Pineau and Geoffrey J Gordon. Pomdp planning for robust robot control. In Robotics Research, pages 69–82. Springer, 2007. [74] Guy Shani, Ronen I Brafman, and Solomon Eyal Shimony. Prioritizing point-based pomdp solvers. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(6):1592–1605, 2008. [75] Michael L Littman, Anthony R Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, pages 362–370. Elsevier, 1995. [76] Milos Hauskrecht. Value-function approximations for partially observable markov decision processes. Journal of artificial intelligence research, 13:33–94, 2000. [77] S´ ebastien Paquet, Ludovic Tobin, and Brahim Chaib-Draa. An online pomdp algorithm for complex multiagent environments. In Proceedings of the fourth international joint confer- ence on Autonomous agents and multiagent systems, pages 970–977, 2005. [78] S´ ebastien Paquet, Ludovic Tobin, and Brahim Chaib-draa. Real-time decision making for large pomdps. In Conference of the Canadian Society for Computational Studies of Intelli- gence, pages 450–455. Springer, 2005. 154 [79] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pages 2164–2172, 2010. [80] David A McAllester and Satinder Singh. Approximate planning for factored pomdps using belief state simplification. arXiv preprint arXiv:1301.6719, 2013. [81] Dimitri P Bertsekas and David A Castanon. Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5(1):89–108, 1999. [82] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. In NIPS, volume 13, pages 1772–1780, 2013. [83] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2):193– 208, 2002. [84] JK Satia and RE Lave. Markovian decision processes with probabilistic observation of states. Management Science, 20(1):1–13, 1973. [85] Richard Washington. Bi-pomdp: Bounded, incremental partially-observable markov-model planning. In European Conference on Planning, pages 440–451. Springer, 1997. [86] St´ ephane Ross, Brahim Chaib-Draa, et al. Aems: An anytime online search algorithm for approximate policy refinement in large pomdps. In IJCAI, pages 2592–2598, 2007. [87] Nassim Motamedidehkordi, Sasan Amini, Silja Hoffmann, Fritz Busch, and Mustika Riziki Fitriyanti. Modeling tactical lane-change behavior for automated vehicles: A supervised machine learning approach. In 2017 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), pages 268–273. IEEE, 2017. [88] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436– 444, 2015. [89] Gagan Bansal and Daniel Weld. A coverage-based utility model for identifying unknown unknowns. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [90] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. [91] Maxim Egorov. Deep reinforcement learning with pomdps. Technical report, Tech. Rep.(Technical Report, Stanford University, 2015), 2015. [92] Amir Rasouli, Pablo Lanillos, Gordon Cheng, and John K Tsotsos. Attention-based active visual search for mobile robots. Autonomous Robots, 44(2):131–146, 2020. [93] Timothy H Chung and Joel W Burdick. A decision-making framework for control strategies in probabilistic search. In Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 4386–4393. IEEE, 2007. 155 [94] Tomasz Niedzielski, Mirosława Jurecka, Bartłomiej Mizi´ nski, Joanna Remisz, Jacek ´ Slopek, Waldemar Spallek, Matylda Witek-Kasprzak, Łukasz Kasprzak, and Małgorzata ´ Swierczy´ nska-Chla´ sciak. A real-time field experiment on search and rescue operations as- sisted by unmanned aerial vehicles. Journal of Field Robotics, 35(6):906–920, 2018. [95] James Mardell, Mark Witkowski, and Robert Spence. A comparison of image inspection modes for a visual search and rescue task. Behaviour & Information Technology, 33(9):905– 918, 2014. [96] Samuel Barrett, Matthew E Taylor, and Peter Stone. Transfer learning for reinforcement learning on a physical robot. In Ninth International Conference on Autonomous Agents and Multiagent Systems-Adaptive Learning Agents Workshop (AAMAS-ALA), volume 1, 2010. [97] Alessandro Lazaric. Knowledge transfer in reinforcement learning. PhD thesis, PhD thesis, Politecnico di Milano, 2008. [98] Guy Shani, R Brafman, and S Shimony. Adaptation for changing stochastic environments through online pomdp policy learning. In Proc. Eur. Conf. on Machine Learning, pages 61–70. Citeseer, 2005. [99] Girish Joshi and Girish Chowdhary. Adaptive policy transfer in reinforcement learning. arXiv preprint arXiv:2105.04699, 2021. [100] Akshay Narayan and Tze Yun Leong. Policy transfer in reinforcement learning: A selective exploration approach. [101] Milica Gasic, Catherine Breslin, Matthew Henderson, Dongho Kim, Marcin Szummer, Blaise Thomson, Pirros Tsiakoulis, and Steve Young. Pomdp-based dialogue manager adap- tation to extended domains. In Proceedings of the SIGDIAL 2013 Conference, pages 214– 222, 2013. [102] Milica Gaˇ sic, Dongho Kim, Pirros Tsiakoulis, Catherine Breslin, Matthew Henderson, Mar- tin Szummer, Blaise Thomson, and Steve Young. Incremental on-line adaptation of pomdp- based dialogue managers to extended domains. In Proceedings on InterSpeech, 2014. [103] Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, and Qiang Yang. Personalizing a dialogue system with transfer reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [104] Parisa Pouya and Azad M. Madni. Policy transfer in pomdp models for safety-critical au- tonomous vehicles applications (under review). In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022. [105] Eliˇ ska Zugarov´ a and Tatiana V Guy. Similarity-based transfer learning of decision policies. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 37–44. IEEE, 2020. [106] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017. 156 [107] Parisa Pouya, Azad M Madni, Arun Adiththan, S Ramesh, and Prakash Peranandam. Model- based performance evaluation of safety-critical pomdps. In 2021 IEEE International Con- ference on Recent Advances in Systems Science and Engineering (RASSE), pages 1–8. IEEE, 2021. [108] Gethin Norman, David Parker, and Xueyi Zou. Verification and control of partially observ- able probabilistic systems. Real-Time Systems, 53(3):354–402, 2017. 157 Appendices A.1 Belief-Space Approximation Pseudo-code Pseudocode presented in algorithm 4 shows the procedure for approximating the target model and environment belief-space based on the source belief-space and available mapping functions for the state and action spaces, f(S 0 ) and f(A 0 ), respectively. This algorithm takes all available belief clusters8c2 C, a.k.a re-defined states for Q-learning, a counter limit, terminal condition Cond T , target state-space S 0 , target action-space A 0 , target observation-space W 0 , target transition T 0 and emission O 0 probabilities, and estimated source Q-values as input and approximates the target belief-space B 0 within a while loop. 158 Algorithm 4 Pseudocode for Belief-Space Approximation Using Source Belief and Mapping Function 1: function APPROXIMATEBELIEFSPACE(8c2 C;Count;Cond T ;S 0 ;A 0 ;W 0 ;T 0 ;O 0 ;Q) 2: initialize i 0; B 0 [] 3: while i Coun do 4: b= 1 jS 0 j jS 0 j 5: while True do 6: insert b 0 into B 0 7: b 0 extend/compress b 0 given f(S 0 ) 8: c similar find c2 C with max similarity to b 0 9: action argmax a2A Q(c similar ) 10: mapped action f(action) 11: ifj mapped action j> 1 then 12: action randomChoice( mapped action ; w i ) 13: else 14: action mapped action 15: o 0 W 0 find high probability obs given b and action 16: for8o2fo 0 g do 17: b next U pdateBelie f(b;action;o;T 0 ;O 0 ) 18: insert b next into B 0 19: b U pdateBelie f(b;action;randomChoice(fo 0 g);T 0 ;O 0 ) 20: if Cond T = True then 21: break 22: i= i+ 1 23: return B 0 159 A.2 Adaptation via Expansion Pseudo-code Pseudocode presented in algorithm 5 provides the procedure for online adaptation via expansion. This algorithm is implemented as a function in Python (Python 3.6) and takes the new, unknown observation at time-step t, current transition T , emission O, reward R matrices, similarity dictio- nary (i.e., hash table) SD, and belief decay factor z as input and returns the updated transition, emission, and reward matrices that include dynamics of the newly added state. Algorithm 5 Pseudocode for Online Adaptation via Expansion Algorithm 1: function ADAPTATIONVIAEXPANSION(o 0 t ;T;O;R;SD;z ) 2: initialize T new H zeros(jAj;jSj+ 1) 3: initialize O new H zeros(jAj;jWj+ 1) 4: initialize R new H 0 5: W sim , Index, label GenerateLabel(o 0 t ) 6: fS sim g 8s i 2 S s:t: i2 Index 7: if label = 2 SD or SD=[] then 8: SD label 1 9: else 10: SD label SD label + 1 11: calculate W Dt=SD[label] S sim using equation 4.6 12: Estimate T new H ; O new H ; and R new H using equation 4.4 and 4.5 13: if thenSD[label]= 1 14: T 8s2 S;8a2 A;H e # augmenting T with small probs 15: T 8s2 S;8a2 A;o 0 e # augmenting O with small probs 16: insert T new H ; O new H ; R new H to T, O, R 17: if Criterion(T new H ; O new H ; R new H ) = True then 18: update T old H ; O old H ; R old H using equation 4.7 19: return T, O, R 160 A.3 Example of Verifying Newly Added Actions with Respect To Physical Constraints and Requirements The following presents an example of verifying POMDP actions (e.g., newly added actions after refinement) with respect to physical constraints and requirements. For the purpose of this example, feasibility of action a 3 : Brake with maximum acceleration rate (6 m s 2 ) in the refined lane-keeping model is evaluated an verified. For this purpose, the physics model and characteristics of an exem- plar vehicle is extracted from the CARLA simulation. These characteristics are as follows: • Vehicle mass: 1000 Kg • Tire radius: 30 cm (0.3 m) • Maximum brake torque of a tire: 1500.00 N.m Assuming that no aerodynamic force exists and the vehicle travels on a straight line (no road cur- vature) with no slope, the feasibility of the ”Brake with maximum acceleration rate” action can be verified as shown below: Force required for braking with6 m s 2 , given mass is 1000 Kg: 1000 x 6 = 6000 Kg:m s 2 . Assuming that all 4 brakes associated with 4 wheels are engaged during braking, there is no mass transmis- sion, and braking force is distributed equally on wheels, the brake torque can be calculated as follows: 6000 4 0:3= 1500 0:3= 450N:m, which is less than the maximum brake torque specified for each tire (1500.00 N.m). This implies that the vehicle can brake with6 m s 2 without violating physical constraints. 161
Abstract (if available)
Abstract
Complex systems, such as Autonomous Vehicles (AVs), usually operate in uncertain and reactive environments, where they are exposed to variety of noisy and partially available information from the environment. The dynamic and reactive nature of these environments require the systems to continuously adapt and respond with a plan or decision while accounting for the uncertainty and partially available information and adhering to constraints posed by the system and environment. This implies that when designing models for capturing the system-environment interactions (i.e., behavior of the system in its environment), deterministic approaches and traditional, static modeling techniques may not be sufficient to account for probabilistic behavior of complex systems in the environment.
Probabilistic models and techniques, such as Partially Observable Markov Decision Processes (POMDPS) have been successfully employed for modeling system-environment interactions in presence of partial observability and uncertainty. POMDPs (a reinforcement learning technique - a subset of machine learning problems) are state-based models, in which the non-deterministic transitions in-between states and partial observations are addressed using probability distributions (so-called transition and emission functions and probabilities). These models are typically designed with respect to a scenario and a goal in an environment, where the goal is modeled using a reward function. Due to partial observability, the state of the model is continuously inferred and updated probabilistically using Bayesian techniques. The overall goal in a POMDP model is to find a mapping between the probabilistic states (so-called belief) and available plans/actions (so-called policy and policy estimation) that maximizes the collected rewards over a time-horizon, so that the system can optimally react to observations from the environment and achieve the goal while adhering to constraints posed by the system and environment.
Although POMDP models are shown to be successful for planning and decision-making based on probabilistic behavior of complex systems, gaps and limitations remain. For instance, POMDP models usually suffer from scalability issues due to employing large state-spaces and because of the probabilistic nature of the policy estimation problem. In addition, these models are usually designed with respect to a subset of limited information available at the outset. This implies that model can be exposed to new, previously unseen observations, either due to missing information or changes in the environment, which affects the performance of state inference and decision-making (e.g., state is inferred incorrectly, and a risky decision is made). There exists substantial amount of research work and proposed techniques to address the scalability issues by only optimizing policy estimation techniques to achieve near real-time performance. Examples of such techniques are heuristic algorithms, guided search, information pruning, and using a priori information. However,
these techniques only target the scalability issues associated with policy estimation (not model scalability) and are typically implemented to work for POMDP models with specific structures and formulations. To account for new information in model, various techniques, such as: 1- modeling all possible changes and 2- transfer knowledge are employed. The first technique models each parameter that may change as a “state variable” of a new, enlarged POMDP model and estimates policy with regards to all possible changes. However, this technique suffers from scalability issues. On the other hand, transfer knowledge uses the information and experience gained from performing one task in an environment for accomplishing different but similar tasks/goals in a similar environment. Although transfer-based techniques have shown promising results for deterministic problems, there is only limited work for transfer in POMDPs. Moreover, majority of transfer knowledge techniques require the differences and similarities between environments/tasks to be known as a priori, which is a strong assumption and may not be applicable in real-world problem domains, mainly due to presence of unknown-unknowns (information that we don’t know that we don’t know).
In the light of foregoing, this thesis presents a novel modeling technique based on POMDPs that employs heuristics, machine learning, and data analysis techniques to: 1- account for scalability in both model design and policy estimation to achieve near real-time performance, and 2- gradually adapt and refine a designed POMDP model and estimated policy with respect to new, unknown information as it reveals from the environment. The first chapter in this thesis provides the background, motivation, research goals, and hypothesis. An overview of previous works associated with POMDP modeling, policy estimation, and adaption is also provided in this chapter. In the second chapter, a detailed discussion and overview of existing POMDP modeling techniques is provided. In addition, the ”Expandable-Compact POMDP” modeling technique is presented and discussed. To account for scalability in the POMDP model design, states are defined to represent clusters/patterns of similar events obtained from the environment (instead of using distinct state variables and datapoints as states), where both desired (goal) and failure events are also modeled as states. Using distinct clusters to represent states leads to a compact model design and drastically reduces the size and complexity of the model. Moreover, it allows for employing data analysis techniques on available clusters to identify similar states, observations, which are later used for developing heuristics for policy estimation, adaptation, and refinement. Two different compact POMDPs are designed for lane-keeping and lane-changing (safety-critical applications of AVs) in a multi-lane freeway environment to perform safe and collision decision-making. In the next chapter (chapter 3), an overview of existing policy estimation algorithms and techniques, including the employed heuristics, information pruning methods, and available a priori information are presented and discussed. To address the scalability issues associated with policy estimation, an online, adaptive policy estimation algorithm is implemented and presented that employs data and model-driven heuristics to perform a guided search and prune unnecessary search directions (sub-optimal search directions) in addition to parallel processing to achieve near real-time performance. The performance (in terms of computation time and policy optimality) of this algorithm is tested and verified by comparing to policies obtained using a benchmark policy estimation algorithm and a deep learning model implemented using neural networks. Chapter 4 presents a detailed discussion on available adaptation via transfer learning techniques and provides a novel, 2-phase adaptation and refinement algorithm and technique for the Expandable-Compact POMDPs. This technique has two phases: 1) online adaptation via expansion and 2) offline post-expansion refinement. In the initial phase, new, previously unseen information is collected and accounted for by gradually adding new states and observations to the model and using collected data to estimate and update underlying dynamics based on a similarity metric. In the offline phase (post-expansion), the performance of the adapted model and collected data are analyzed using data analysis techniques to identify possible inaccuracies and make further refinements. This technique and proposed algorithm are employed for adapting the designed POMDPs in chapter 2 to more complex and riskier environment, where risky behaviors such as cutting off to other lanes, sudden stopping on the lane, closing the gap during lane-changing are expected from other vehicle agents in the environment. The performance of this technique and algorithm is compared to a policy transfer technique and the results are discussed. Finally, to evaluate and verify the proposed modeling technique (including compact and expandable POMDP model design, policy estimation, adaptation and refinement), the adapted models and the policy estimation algorithm are integrated with CARLA simulation environment and results are discussed. The final chapter in this thesis (chapter 5) provides a summary, conclusion and discusses the future directions in this research.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Learning and control for wireless networks via graph signal processing
PDF
A declarative design approach to modeling traditional and non-traditional space systems
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Extending systems architecting for human considerations through model-based systems engineering
PDF
Modeling and simulation testbed for unmanned systems
PDF
Robust and adaptive online decision making
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Transfer learning for intelligent systems in the wild
PDF
Active state tracking in heterogeneous sensor networks
PDF
High-accuracy adaptive vibrational control for uncertain systems
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Advanced nuclear technologies for deep space exploration
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Kernel methods for unsupervised domain adaptation
PDF
Distributed adaptive control with application to heating, ventilation and air-conditioning systems
Asset Metadata
Creator
Pouya, Parisa
(author)
Core Title
Context-adaptive expandable-compact POMDPs for engineering complex systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Astronautical Engineering
Degree Conferral Date
2022-12
Publication Date
08/19/2022
Defense Date
08/05/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptation,complex systems,Decision making,machine learning,OAI-PMH Harvest,partial observability,partially observable Markov decision processes,planning,probabilistic models,reinforcement learning,transfer learning,uncertainty
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Madni, Azad M. (
committee chair
), Erwin, Dan (
committee member
), Moore, Jim (
committee member
)
Creator Email
parissapouya@gmail.com,pouya@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376962
Unique identifier
UC111376962
Legacy Identifier
etd-PouyaParis-11146
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Pouya, Parisa
Type
texts
Source
20220819-usctheses-batch-974
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adaptation
complex systems
machine learning
partial observability
partially observable Markov decision processes
probabilistic models
reinforcement learning
transfer learning
uncertainty