Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Transfer reinforcement learning for autonomous collision avoidance
(USC Thesis Other)
Transfer reinforcement learning for autonomous collision avoidance
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2019 Xiongqing Liu Transfer Reinforcement Learning for Autonomous Collision Avoidance By Xiongqing Liu A Dissertation Submitted to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MECHANICAL ENGINEERING) August 2019 i Abstract In order to build a collision avoidance (CA) system with high resilience, robustness and adaptability, learning capability is required for the system to function in an unknown and dynamic environment. One of the major limitations of current machine learning systems is that the learning agent can only function well in its own domain and fails to solve more challenging but similar tasks, which makes it crucial to build a transfer mechanism for the system to utilize previous experience. In order to solve a complex CA task which includes random and dynamically moving obstacles, a deep reinforcement learning mechanism has been implemented first to solve simple CA tasks. A belief-based transfer reinforcement learning (TRL) approach is then proposed to efficiently balance exploration and exploitation in the target tasks given the expert guidance. The approach features two key concepts: the transfer belief that signifies the likelihood or the probability of the agent choosing the expert suggested action (i.e., transfer action); and the transfer period that indicates how long the agent’s decision making is influenced by the expert. Various case studies with different levels of inter-task similarity have been conducted to evaluate the different values of transfer belief and transfer period. The optimal values of transfer belief and transfer period can boost learning speed and reduce learning variance in the target tasks. Different transfer paths have been investigated from single or multiple source tasks, and different forms of reward functions studied for devising different learned behaviors. The results provide designers with a deeper understanding of the collision avoidance problem and shows how the proposed transfer learning can shed insights from the previous knowledge into new tasks. ii Acknowledgments After five years of studying and researching at USC, I would like to first thank my advisor, Dr. Yan Jin, for trusting me when we first met in Shanghai, for letting me pursue the research direction that most interests me, and for supporting and guiding me whenever I felt lost. His expertise in design methodology has helped to shape my own research instincts and always inspired me how artificial intelligence could influence other educational fields. I also want to express my gratitude to the professors who served on my dissertation committee, Dr. Henryk Flashner, Dr. Geoffrey R. Shiflett, Dr. Ivan Bermejo-Mereno, and Dr. Qiang Huang, who provided plenty of valuable suggestions and inputs to my thesis. I am grateful to the financial support from the Monohakobi Technology Institute (MTI) and Nippon Yusen Kaisha (NYK). My Ph.D. work would not be possible without MTI’s sponsored project. Thank you to the MTI/NYK team, Dr. Ando Hideyuki, Dr. Fumitaka Kimura, Capt. Koji Kutsuna, Mr. Takuya Nakashima and Mr. Ryo Yamaguchi, for your fruitful discussions, and for providing real-world data to validate my research and supporting us to conduct the experiments on a ship simulator in Japan. To my friends and teammates at the USC IMPACT Laboratory, it has been a great experience working with you all. To Dr. James Humann and Newsha Khani, upon whose work my research is built. To Edwin Williams, it is my great pleasure to collaborate with you on the same project over the past three years. To Hao Ji, Jamey Zhang, Bernice Huang and Jojo Qiu, who always shared thoughts on my research topic. And to the robotics group, Duo Ding, Jason Gong, Chuanhui Hu, Siqi Zheng, Jerry Fu, Anqi Zheng and Qi Zhang, who built the physical platform to conduct self-driving experiments. Thank you all. iii Finally, I want to thank my family, especially my parents Jinhua and Yaogui. I feel blessed to have my parents’ support on every decision I have made. They have taught me to be kind to others, to be humble when making achievements, and to persevere in adversity. Without their unconditional love and support, I could not have overcome every hurdle along my Ph.D. study. Xiongqing Liu August 1, 2019 iv Contents Abstract ........................................................................................................................................................ i Acknowledgments ...................................................................................................................................... ii List of Figures ............................................................................................................................................ vi List of Tables ........................................................................................................................................... viii List of Equations ....................................................................................................................................... ix 1. Introduction ........................................................................................................................................ 1 1.1. Background ................................................................................................................................. 1 1.2. Autonomous collision avoidance ............................................................................................... 4 1.3. Deep learning and transfer learning ....................................................................................... 10 1.4. Overview ................................................................................................................................... 16 2. Related Work .................................................................................................................................... 17 2.1. Cellular self-organizing (CSO) systems .................................................................................. 17 2.2 Collision avoidance and path planning ................................................................................... 22 2.3 Deep (reinforcement) learning ................................................................................................ 25 2.4 Transfer learning ..................................................................................................................... 32 3. Belief-based TRL Approach ............................................................................................................ 37 3.1 Overview ................................................................................................................................... 37 3.2 Deep reinforcement learning ................................................................................................... 40 v 3.3 Transfer reinforcement learning (TRL) ................................................................................. 45 3.4 Transfer action ......................................................................................................................... 48 3.5 Agent learning behavior and reward design .......................................................................... 55 3.6 Task complexity and similarity model .................................................................................... 60 4 Case studies and results ................................................................................................................... 68 4.1 Transfer reinforcement learning ............................................................................................. 69 4.1.1 Case I: High inter-task similarity (T1 à T2) ..................................................................... 70 4.1.2 Case II: Low inter-task similarity (T1 à T6) .................................................................... 76 4.1.3 Case III: Medium inter-task similarity (T5 à T6) ............................................................ 84 4.2 Transfer path ............................................................................................................................ 86 4.3 Reward tuning .......................................................................................................................... 91 4.4 Summary of findings .............................................................................................................. 101 5 Contributions and Future Work ................................................................................................... 106 5.1 Contributions .......................................................................................................................... 106 5.2 Future work ............................................................................................................................ 109 5.3 Final remarks ......................................................................................................................... 110 Works cited ............................................................................................................................................. 111 vi List of Figures Figure 1: A real-time path planning example based on RRT* algorithm .................................................... 5 Figure 2: Autonomous collision avoidance at sea ....................................................................................... 8 Figure 3: AlphaGo pipeline ....................................................................................................................... 11 Figure 4: Self-organizing structure: from spider to snake ......................................................................... 18 Figure 5: Screenshots of box-pushing task ................................................................................................ 21 Figure 6: Grid paths (Left) vs. true shortest path (Right) .......................................................................... 24 Figure 7: Autonomous inverted helicopter via reinforcement learning ..................................................... 27 Figure 8: Robotic application of hindsight experience replay ................................................................... 28 Figure 9: NVIDIA’s end-to-end learning approach to self-driving car ..................................................... 31 Figure 10: Effectiveness of transfer in reinforcement learning ................................................................. 34 Figure 11: The agent-environment interaction in reinforcement learning ................................................. 37 Figure 12: Research areas ......................................................................................................................... 39 Figure 13: The standard DQN (top) and the dueling network structure (bottom) ..................................... 44 Figure 14: - greedy policy (left) and - greedy policy (right) .............................................................. 47 Figure 15: Single and multiple source tasks .............................................................................................. 49 Figure 16: Ranking all actions’ Q-values .................................................................................................. 50 Figure 17: Multiple policies in collision avoidance .................................................................................. 52 Figure 18: Game environment ................................................................................................................... 55 Figure 19: Rewards in an Atari game ........................................................................................................ 57 Figure 20: Agent learning behavior ........................................................................................................... 59 Figure 21: Types of complexity in collision avoidance ............................................................................. 61 Figure 22: The complexity space .............................................................................................................. 62 e T e vii Figure 23: Different collision avoidance tasks .......................................................................................... 64 Figure 24: Standard scenarios for collision avoidance at sea: ................................................................... 67 Figure 25: Collision avoidance game system architecture ........................................................................ 68 Figure 26: Source task (left): one static obstacle; Target task (right): two static obstacles ....................... 70 Figure 27: Average performance of varying transfer periods .................................................................... 72 Figure 28: Different student performances under each transfer period ..................................................... 74 Figure 29: Standard deviation plot of varying transfer periods before convergence ................................. 75 Figure 30: Source task T1(left): one static obstacle; Target task T6 (right): two moving obstacles ......... 76 Figure 31: Performance of β 0 = 0.9 and Γ = 700k ..................................................................................... 78 Figure 32: Performance of β 0 = 0.9 and Γ = 300k ..................................................................................... 79 Figure 33: Performance of β 0 = 0.5 and Γ = 700k ..................................................................................... 81 Figure 34: Performance of β 0 = 0.3 and Γ = 700k ..................................................................................... 82 Figure 35: Performance of β 0 = 0.1 and Γ = 700k ..................................................................................... 83 Figure 36: Performance in the medium inter-task similarity case ............................................................. 85 Figure 37: Different transfer paths: ........................................................................................................... 87 Figure 38: Game setup using relative positions ......................................................................................... 89 Figure 39: Different transfer paths – single source task vs. multiple source tasks .................................... 90 Figure 40: Game setup .............................................................................................................................. 91 Figure 41: Reward plotting of different deviation weights ........................................................................ 95 Figure 42: Average goal_reaching time under different deviation weights ............................................... 97 Figure 43: Different learned strategies ...................................................................................................... 99 viii List of Tables Table 1: Agent actions ............................................................................................................................... 56 Table 2: Task specifications ...................................................................................................................... 64 Table 3: Case parameters ........................................................................................................................... 69 Table 4: Action space in reward tuning case study ................................................................................... 92 Table 5: Choice of hyperparameters in reward tuning case study ............................................................. 93 ix List of Equations Eq. 1 ........................................................................................................................................................... 20 Eq. 2 ........................................................................................................................................................... 35 Eq. 3 ........................................................................................................................................................... 41 Eq. 4 ........................................................................................................................................................... 41 Eq. 5 ........................................................................................................................................................... 41 Eq. 6 ........................................................................................................................................................... 42 Eq. 7 ........................................................................................................................................................... 42 Eq. 8 ........................................................................................................................................................... 43 Eq. 9 ........................................................................................................................................................... 45 Eq. 10 ......................................................................................................................................................... 46 Eq. 11 ......................................................................................................................................................... 51 Eq. 12 ......................................................................................................................................................... 51 Eq. 13 ......................................................................................................................................................... 51 Eq. 14 ......................................................................................................................................................... 53 Eq. 15 ......................................................................................................................................................... 54 Eq. 16 ......................................................................................................................................................... 56 Eq. 17 ......................................................................................................................................................... 57 Eq. 18 ......................................................................................................................................................... 58 Eq. 19 ......................................................................................................................................................... 62 Eq. 20 ......................................................................................................................................................... 62 1 1. Introduction 1.1. Background As engineered systems are becoming more and more complex, designers are facing the challenge of building a system with robustness, stability, resilience, flexibility and adaptability in order to deal with the dynamic and unpredictable task environment. Traditional design process involves three stages: functional design, conceptual design and technical design. Customer needs are first mapped into functional requirements, based on which the design parameters and ideas are generated, and finally evaluated. The design process is iterative where the designer has to constantly evaluate and modify detailed designs to match functional requirements. As such, the designer has full control and knowledge of the functionality and behavior of each component of the system, meaning that both the functions and behaviors of the engineered systems are heavily relying on the designer’s domain knowledge. Self-organizing systems are distributed complex systems, which are often found in the biological world, for example, the behavior of flocking birds, a school of fish, or a swarm of insects. It is surprising that thousands of insects and ants can move piles of dirt to build tunnels and form a living structure. To better understand the self-organizing behaviors in nature, a Cellular Self-Organizing (CSO) system approach has been proposed (Zouein and Jin 2010, Chiang and Jin 2011, Jin and Chen 2013, Khani and Jin 2014, Humann and Jin 2015). A CSO system is composed of many mechanical cells (mCells). Each mCell is a simple component (i.e., a physical robot) which has its own limited sensing range and its behavior is governed by the local rules encoded in its DNA. The global 2 behavior at the system level will emerge from the local interactions among these simple mCells. Self-organizing systems have several advantages over traditional engineered systems, such as low cost and easy manufacturability, flexibility, robustness, scalability, and resilience (Humann 2016). The agent’s behavior is governed through two types of approaches: field-based regulation (Chen 2012, Humann 2015) and rule-based regulation (Khani 2014). The former is built upon an artificial potential field (including a task field that stems from the stimuli in the environment and a social field among agents in the same vicinity) and a COARM model (cohesion, avoidance, alignment, randomness, momentum). The parameters of the COARM model are optimized using genetic algorithms until the system reaches an optimal global behavior. The rule-based approach is more focused on designing social rules among multiple agents to avoid conflicts and encourage corporation. These approaches have been proved successful in many applications such as mimicking the behavior of flocking and foraging, and box-pushing. However, when the environment becomes more complex and dynamic, the CSO systems need a higher precision of sensing to frequently update the agent’s state and field values, or a set of sophisticated rules defined by the designer to deal with all the possibilities. To make a CSO system function well in these unknown dynamic environments requires the designer to process the comprehensive domain knowledge about the environmental dynamics, which is often hard to obtain in the real life. Thus, there is still a great gap in between the computational self-organizing systems and real-world robotic applications by building a self-driving system. The self-driving industry is a rapidly growing field in recent years and fully autonomous cars are expected to become available in early 2020s, if not earlier. Self- 3 driving cars are potentially the first type of robots that could play a crucial role in people’s everyday life. With autonomous cars driving on the road, either local streets or freeways, it can be treated as a self-organizing system, which is able to achieve autonomous collision avoidance among different agents. A self-driving system is considered to be more engineered in the sense that each agent (cell) is equipped with more advanced hardware with various sensors, and more intelligence, which is at least able to (a) process human- level perceptions, i.e., detecting lanes, traffic signs, pedestrians, other vehicles/obstacles, etc.; (b) predict trajectories of pedestrians and other vehicles; (c) deal with complex social interactions with other vehicles and pedestrians, and (d) make high-level decision making and path planning. These kinds of intelligence would require learning capabilities of the agent. It is worth noting that adding learning capabilities to the agent does not necessarily increase the cost and manufacturability because learning can take place off-line. To deploy the learned result online, usually in a form of neural network result does not require much computational power. A self-organizing system with learning capabilities is believed to possess more adaptability and flexibility towards dynamic and unknown environments. 4 1.2. Autonomous collision avoidance Collision avoidance, which has been widely studied and researched in many industrial fields over the years, such as robotics, transportation, artificial intelligence and so forth, is an indispensable component for building self-driving systems. In the area of robotics, research has been focused on issues related to how vehicle robots avoid obstacles as well as each other (Brunn, 1996; Alonso-Mora, 2013; Shiomi et al, 2014) and how assembly robots, or manipulators, avoid interferences among its own arms or with those of others (Hourtash et al, 2016; Hameed & Hasan, 2014). Another major field where collision avoidance represents a major problem is transportation. Self-driving cars must be able to avoid obstacles and other vehicles in various situations (Mukhtar et al, 2015). In the shipping industry, collision avoidance can be highly difficult, when the water areas are becoming congested, due to the large inertia of ships causing immovability when movement is needed (Goerlandt & Kujala, 2014). Once a collision happens at sea, the loss can be tremendous (Eleftheria et al, 2016). Airplane collision avoidance (Zou, 2016) and even the collision with debris in space (Casanova et al, 2014) have become issues due to the increasing level of congestion. Different approaches have been proposed to solve collision avoidance problems, which can be divided into two large categories, one is vehicle control system development and the other is traffic system development. Vehicle control can be further categorized into the dynamical systems approach (e.g., Machado et al, 2016), which relies on traditional control theories, and the intelligent systems approach (e.g., Yang et al, 2017), which applies knowledge systems and machine learning techniques. While the dynamical systems 5 approach can be effectively applied in mostly predictable circumstances, when the uncertainty level becomes high and exceptions happen, the intelligent systems approach will be needed. Traditional knowledge-based systems have been applied to collision avoidance (Jin and Koyama, 1987). However, the issues of knowledge acquisition, formalization and management have remained to be practically challenging. Path planning is a class of algorithms which aims to find collision-free trajectories so that a robot (robotic arm) can reach the goal location as fast as possible, including A* search algorithm (Hart, Nilsson and Raphael 1968), the dynamic window approach (Fox et al., 1997; Brock and Khatib, 1999), Rapidly-exploring Random Trees, or RRT (La Valle, 1998), gradient based method (Konolige, 2000), etc. These algorithms can deal very well with static obstacles, whereas real-time path planning with moving obstacles remains to be challenging. A survey of current approaches on real-time path planning can be found in (Ragstgoo et al. 2014), which can be categorized into tree-based, graph-based, and potential-based. These algorithms have been proved to be successful mostly in a car-like computer game (Figure 1). Figure 1: A real-time path planning example based on RRT* algorithm 6 In recent years, the development of autonomous vehicles, especially autonomous cars, has been a common topic reported almost daily in newspapers and television programs, thanks to the advance in control systems and the proliferation of machine learning techniques. Our research on collision avoidance for ships at sea, named AutonoShip project, aims at developing technologies that will eventually lead to the future autonomous ships that can not only steer in open waters and less congested areas but also are capable of avoiding collisions in congested harbors and reaching alongside berths without direct human involvement. When it comes to autonomous shipping, the traditional path planning and collision avoidance algorithms often fail because, (a) the environment is dynamic with other moving vehicles and complex geographic constraints, (b) the possible future locations that the agent can visit are limited by its own dynamics (a very big inertia, etc.); (c) collisions at sea are often caused by bad maneuvers 1 or 2 min before, which makes the search space extremely large for a search algorithm. As a result, a more synthetic and intelligent system is needed. In order to achieve this long-term goal, three research and development stages have been planned to include: technology exploration (investigate various technology capabilities and application possibilities, our current stage), human decision support (use the autonomous technologies as human decision aids to reduce human errors), and full autonomy (ships steer autonomously). To meet the challenges of this highly complex topic, we identified three research directions. They are: • AutonoShip-S(ituation): A risk and situation modeling based stochastic computational approach to develop computational situation assessment and decision support apparatuses that are based on rigorous mathematical models and 7 real-world data to perform autonomous decision-making (Williams and Jin, 2018a, 2018b). • AutonoShip-H(uman): A human steering data based supervised learning approach to capture human knowledge for steering ships in congested areas (Jin, et al, 2018) • AutonoShip-Z(ero): A reinforcement learning based approach to generating ship steering knowledge in form of neural networks through extensive computer simulation-based training (Liu and Jin, 2018a, 2018b) From a system development point of view, the development of AutonoShip-Z must satisfy two basic requirements. First, since the training process of reinforcement learning ship agents relies on computer simulations, the simulation environment must be physically compatible with the real-world environment, meaning that both ship steering dynamics and encounter situations must be real enough so that the neural networks trained and tested from the simulations can be directly applied by real ships. Our research sponsors have provided the needed ship dynamics models and standard (for human ship master training) ship collisions avoidance encounters; and the experiments with the real on-board ship control equipment has yielded satisfactory results (MTI, 2018, Figure 2). In October 2018, our research team conducted several experiments on ship simulators with the support of MTI. Standard scenarios were first trained, which covers situations such as crossing, head on, over taking, and combinations of these. The ship simulator includes real water dynamics which provides accurate estimate of ship positions and headings, and sends all sensor information (GPS, AIS, radar etc.) to the computer agent, which in turn calculates an output command (speed, course angle) using either reinforcement learning approach (Figure 2 left) or risk modeling & probabilistic approach (Figure 2 right). As can be seen 8 in the top part of Figure 2, both approaches produce similar and slightly different trajectories, which are both acceptable according to the MTI captain’s point of view. In this particular situation, reinforcement learning intends to take a larger left turn initially while risk modeling gradually adjusts its headings. Figure 2: Autonomous collision avoidance at sea 9 The second requirement relates to ship behavior design. Although through deep reinforcement learning, ship agents can learn their optimal policies given the state and action spaces, the state transition and reward functions, the learned neural network is a black-box that provides little identifiable knowledge about agent action mechanisms. To make agents’ behavior more transparent, it is important to understand how the agents’ behavior changes in response to the change of reward functions. Furthermore, it is common that a reinforcement learning agent converges to different “optimal” policies with different random initializations. Such learning variance should be reduced. 10 1.3. Deep learning and transfer learning Considering traditional collision avoidance algorithms suffer when the environment becomes quite dynamic, with random obstacles and other vehicles moving at a certain or random speed / direction, it is necessary and demanded to build a system with learning capabilities that can efficiently learn from its own mistakes of collisions and past experience. The recent progress in machine learning, especially deep learning (LeCun et al, 2015), has opened the ways to developing systems that can learn from humans’ operation experiences (e.g., through supervised deep learning) and from machines’ own experiences (e.g., through reinforcement learning), where the inputs are only high dimensional image pixels. The reinforcement learning approach allows an agent to learn from its past experience. By interacting with the environment, the agent learns to make sequential decisions to maximize the total future reward. With a realization of the fact that learning capability is a missing component in the current CSO systems and could possibly make the system more flexible and adaptable to the dynamic environment, this thesis applies the current state-of-the-art deep learning approaches to solve the problem of collision avoidance, and investigates how to apply previously learned knowledge into a new task environment using transfer learning. In the 2010s, deep learning has achieved tremendous breakthroughs in various fields, such as computer vision and speech recognition. The word “deep” refers to the multiple hidden layers of a neural network. In addition to the classic feed-forward multi- layer perceptron network, CNN and RNN are the most popular network structures. CNN is good at extracting local spatial information from an image by applying a filter and 11 scanning the filter all over the image to learn some features. RNN is good at sequence modeling where time-series input matters. The game of Go is a board game originated in China more than 2500 years ago, which is considered one of the most challenging board games. AlphaGo utilizes the deep convolutional neural networks (CNNs) and beat Lee Sedol in March 2016, which is the first computer Go program to beat a 9-dan professional. In the project of AlphaGo (Silver, 2016), the agent first collects a huge amount of human expert playing experience and learns from the experience of human experts through supervised learning and the behavioral policy gets improved further through self-plays using reinforcement learning. The search algorithm of AlphaGo combines the deep neural networks with Monte-Carlo tree search (MCTS), which evaluates the values of different moves by running many game simulations. Figure 3: AlphaGo pipeline The major concern and limitation is that how can the AlphaGo approach generalize into other machine learning fields / applications, such as playing video games, teaching a 12 computer to drive a car, maneuvering a robotic arm, etc. The game of Go has many unique characteristics that AlphaGo benefits from. First, the environment is deterministic, with no noise involved. One can predict the exact output effect of taking certain action. Second, it is easy to generate simulation experiences using the online Go simulator, which is not always available in other real-world cases. Plus, many datasets of human-plays are available online so that the RL agent does not have to start from scratch. Third, the reward function is straightforward and binary – win or lose. But reward function in other RL problems can be extremely difficult to design in order to guide the agent to make progress. One common observation about the current deep learning systems, including AlphaGo, is that they can only function well within the narrow domain of the tasks that they are trained to work for. It is still extremely difficult to apply the success in the simulations (video games) to other similar domains / tasks, or real-world applications. This observation manifests the limited level of “intelligence” of the current systems, and also leads to a research question: how to deal with the exploration – exploitation trade-off in the new (complex) task setting, given previous learned knowledge from simple tasks? In his seminal paper, March (1991) examined the organizational learning in humans and presented various features of, and relationships between, the essences of human organization learning: exploration of new possibilities and exploitation of old certainties. Allocating resources to these two capabilities represents the adaptiveness of the human organization. Based on this insight, a machine’s intelligence can be considered as composed of the machine’s capabilities of exploration, exploitation, and its ability to regulate the “resource” allocation between the two. Performing too much exploration prevents the agent from exploiting the knowledge it has learned. On the other hand, too 13 little exploration will make the agent miss the chance of discovering alternative actions that could potentially bring higher future rewards. This basic idea has been implemented in our research at two different layers. First, the reinforcement learning itself is based on the exploration-exploitation of the learned knowledge (i.e., a learner’s current neural network) and the random choices. Second, the transfer learning allows the agent to exploit the previously learned experience (i.e., an expert’s neural network obtained from the previous task context) and explore the new task context through learning and exploration. Deep reinforcement learning (DRL) is the study of reinforcement learning using deep neural networks as function approximators. This idea of combining reinforcement learning with neural networks can trace back to early 1990s (Tesauro 1995), but this approach did not outperform other function approximators, for example, linear combinations of features, decision tree and so on. With the huge advancement in hardware development (i.e., GPU computing), DRL has taken off in early 2010s and achieved astonishing results especially in playing video games. The idea of DRL was originally proposed by (Silver, 2013), where a neural network, composed of several convolutional layers, is able to learn to play a series of Atari games using the same network structure, same set of hyperparameters, at the same level of human players or even better. Considering that the input to the neural network is only an array of pixel values of the game window, some tricks have to be used in order to make the learning successful and stable, such as experience replay and setting a target network. A general learning framework that can play a series of Atari games as professionally as humans – with “general” meaning that the same network structure with the same hyperparameters can deal with different Atari video games. This “general” learning framework appears to be appealing at first, yet quite 14 limited in its own domain – i.e., Atari games. The question then is: how can one apply the recent DRL’s success in video games to make a real robot work, say ride a bicycle, drive a car, maneuver a ship, etc.? Although challenging as it is, no learning algorithms can be completely general, at least for now. The focus of this thesis is not to build a general machine learning mechanism that can deal with a series of tasks, but rather to investigate how to link a difficult task into some simple tasks and apply the experience learned from the simple tasks into the learning process to deal with the difficult task. So far there has been only a little research focusing on transfer and reinforcement learning, mostly relying on copying the whole network for initialization or reusing the first few layers of the network. While this is effective in transferring learned features in the image-related tasks, such as image recognition, it is still obscure how it can be translated to the area of reinforcement learning, i.e., what to transfer and how it may work. So far reinforcement learning (RL) is not as popular as other machine learning algorithms like supervised learning and unsupervised learning which have already achieved tremendous success in many fields ranging from research, industry to business. RL is mostly applied to play video games but is rarely seen in solving a complex robotic task partly because the real-world experience is always hard to collect and the environmental dynamics in the simulations is different from the real world. The knowledge and experience gained from the simulations cannot be successfully transferred to a real robot. This research investigates how transfer learning (Pan and Yang, 2010), combined with deep reinforcement learning can be applied to allow agents to exploit and explore in a more complex task context. This ability of synthesizing and expanding knowledge from different sources and domains can revive the RL framework in more complex tasks. The 15 long-term goal of this research is to develop an integrated transfer reinforcement learning (TRL) technique that allows agents to learn from multiple (source) task domains and exploit the learned knowledge in new (target) task contexts for more effective learning and better task performance. It is believed that this knowledge transfer process is crucial to reinforcement learning and potentially helps close the gap between machine learning and human learning. The proposed TRL approach is tested in a collision avoidance video game created in Pygame, where the vehicle dynamics and environmental dynamics and uncertainties can be varied. 16 1.4. Overview This research follows a line of previous work of our research group on CSO systems with a goal to add learning capability to the agent. Section 1 describes the background of this research and covers the main topics of this thesis. Section 2 relates this research to other work in the areas of CSO systems, collision avoidance, deep reinforcement learning (DRL) and transfer learning. Section 3 summarizes the DRL approach based on which this research is built and introduces the transfer reinforcement learning (TRL) framework and key concepts such as transfer belief, transfer period and transfer action. The inter-task similarity between different tasks, and different transfer paths are also discussed in detail. Section 4 presents the case studies together with the experimental results and analyzes the effects of the proposed TRL framework. Section 5 summarizes the contributions of this research and points to future research directions. 17 2. Related Work 2.1. Cellular self-organizing (CSO) systems The USC IMPACT Lab has conducted research on CSO systems since 2007. The goal is to build a multi-agent complex system with adaptability to the dynamic environment and gain insights from the emergent behaviors of self-organization in nature. It is also expected that insights from CSO research can open new ways for designing engineering systems. Self-organizing systems have great potential to be deployed in military missions where the environment is hazardous, as in search and rescue tasks. Other applications include autonomous navigation, robotics, and self-organizing assembly where a group of robots work collaboratively to assemble components into a system configuration. Zouein et al. (2010) started the CSO work by introducing the concept of dDNA which is encoded in every agent and contains information of how to form a global shape, such as “spiders”, or “snakes”. The system was able to adapt its own shape to the dynamically changing requirements from the environmental stimuli. For instance, when a group of agents enters a narrow tunnel, they can reconfigure their shape into a snake and navigate through the surrounding obstacle. It was demonstrated that by modifying dDNA, the mechanical self-organizing systems could grow in some similar way as biological systems. 18 Figure 4: Self-organizing structure: from spider to snake 19 Chiang (2012) focused on understanding how global behavior could be mapped into local interactions of agents, and how to design the local interactions to fulfill the desired emergent behavior. Chiang proposed a parameterized approach based on the COARM model, which is encoded inside the dDNA of each agent. The proposed approach was tested in a sample artificial system with homogeneous programmed agents, which do not have global awareness or central controller and can only sense the agents nearby. Each parameter of the COARM model is given as follows: 1. Cohesion: step toward the center of neighboring agents 2. Avoidance: step away from agents that are too close 3. Alignment: step in the direction that aligns with neighboring agents’ heading 4. Randomness: step in a random direction 5. Momentum: step in the same direction as in previous time-step. Chiang tested different combinations of these 5 parameters and recorded the global behavior. Different flocking patterns are obtained by altering the parameter values. The results are stored in a look-up table in order to be applied to design the desired global behavior. Humann (2014) expanded Chiang’s work by implementing the genetic algorithms (GA) to the COARM model. He introduced a two-field-based behavioral regulation – task field and social field. The task field represents the agent’s perception of the environmental stimuli, such as terrain shape, repulsion from obstacles, and attraction from the system’s goal. The social field models the inter-agent relationship and communications. Take the foraging task as an example, the task field and social field are calculated as following: 20 where is the distance from the agent to its neighbor; is the agent’s current heading direction; is the neighbor’s current heading direction; is the agent’s maximum step- size; is the food orientation; is the home orientation; and are the design parameters stored in dDNA that will be optimized through GA. Humann (2014) also applied the design ontology and synthesis approach to other case studies successfully, such as flocking, protective convoy and box-pushing. There are slight differences in calculating the task field and social field in each of these cases, but the social field are all variants of the COARM model and task field is dependent on the specific task. It is shown that the proposed approach is able to build a self-organizing system with resilience and adaptability. Khani (2014) introduced a social rule-based regulation to the design of CSO systems. As the constraints and uncertainties arise in the environment, social rules and relations are needed to minimize the conflicts among agents and encourage corporation. Khani developed a task complexity model based on three components: action and action relation complexity, object and object relation complexity, and dynamic complexity. More complex tasks require a system with more complexity. Khani investigated how the population size, the specificity of rules and the rule adoption rate could influence the task performance in the box-pushing task (Figure 5), which if not carefully designed, would make the system less efficient. i r f q max s f h ,, , , COAFH Eq. 1 ( ) ( ) ( ) ( ) ( ) max FLD , , cos cos 111 1 FLD , , cos t NN N si i ii i i fsF f H h rC rO A v NNr N hh h hf f f qf q f ÎÎ Î = ×× - +× - éù ëû -- - = × + × + × - åå å 21 Figure 5: Screenshots of box-pushing task One limitation of previous work on CSO systems is that it lacks interactions between the agent and the environment. The environment is assumed to be static, and a global map is needed for the agent to generate a task field. Moreover, the proposed approaches are task-dependent. For example, the parametric approach which uses GA to optimize parameters may fail in a new task; the rule-based approach always requires comprehensive designer’s knowledge of the complexity and conflicts among the agents. 22 2.2 Collision avoidance and path planning The traditional practice to achieve real-time obstacle avoidance was to create an artificial potential field (Khatib, 1986), and has been widely implemented in many applications. Fahimi (2008) implemented harmonic potential functions and the panel method to address multi-robot obstacle avoidance problem in the presence of both static and dynamic obstacles. Each robot considered other robots as moving obstacles. In Fahimi’s work, it was assumed that the positions (and moving trajectories) of the obstacles were known and there existed a ground station (communication center) which calculated the potential gradient and computed a collision-free path for every robot. Mastellone et al. (2008) designed a controller for collision avoidance based on a Lyapunov-type approach which guaranteed coordinated tracking with bounded error and collision avoidance for a group of robots. Each robot was aware of its own position and able to detect other objects within some certain range. The controller for collision avoidance is able to act in real time and is solely based on locally defined potential functions, with no global knowledge. The proposed approach was applied to formation control for multi-agent systems. Mastellone also demonstrated the robustness of the system when the communication between robots was unreliable. Most of current industrial path planning applications are based on graph search strategies, such as A* and RRT algorithms. A* algorithm was originally proposed by Hart, Nilsson and Raphael (1968), and has been widely used in robotics and video games. The problem is formulated in a discretized grid-world with some blocked cells serving as obstacles. The objective is to find the shortest path from the starting vertex to the goal vertex. A heuristic is used to guide the search process, which is assumed to be admissible 23 (i.e., it never overestimates the cost-to-go). For every vertex s, the following values are stored: • The g-value: the length of the shortest path from the start vertex to vertex s found so far. • The h-value: an estimate of the goal distance of vertex s. The f-value is considered as a shortest path from the start vertex to the goal vertex. • The parent parent(s): to extract a path from the start vertex to the goal vertex. The A* algorithm relies on a pre-known map or graph which includes all the obstacle positions (blocked cells), thus the search process has to be repeated from scratch every time the map or graph is updated through the sensory input, which can be very time consuming considering that the optimal path does not change that much. Many algorithms have been designed to efficiently re-compute the shortest path when the graph is updated dynamically, such as D* (Stentz, 1994), Focussed D* (Stentz, 1995), and D* Lite (Koenig and Likhachev, 2002). Daniel et al. (2010) claimed that the grid paths (i.e., paths constrained to grid edges) that A* found were not always the true shortest paths (Figure 6). Thus, they developed Theta* algorithms which are able to propagate information through grid edges and find any-angle paths without constraining the path headings (Figure 6). However, it is not guaranteed to find the true shortest path. () () () fs gs hs =+ 24 Figure 6: Grid paths (Left) vs. true shortest path (Right) The RRT algorithm (Rapidly-exploring Random Trees) was proposed by La Valle (1998) which finds feasible trajectories for high-dimensional non-holonomic systems. RRT includes a NEAREST_NEIGHBOR function which returns the nearest neighbor of a random state , and then applies a control input that minimizes the distance from to . The new state, , is added to the vertex, and an edge from to is restored together with the input . Many algorithms are built upon the RRT algorithm. For example, Karaman and Frazzoli (2010) claimed that RRT converges to a suboptimal solution and introduced RRT* which has better asymptotic optimality under sufficient conditions. Otte and Frazzoli (2014) introduced which focuses on re-planning in real time when the obstacle region changes in order to navigate in a dynamic environment. rand x u near x rand x new x near x new x u X RRT 25 2.3 Deep (reinforcement) learning Deep learning has achieved tremendous success in various areas such as image recognition (Krizhevsky et al., 2012; Le et al., 2012), speech recognition (Hinton et al., 2012), automatic game playing (Mnih et al., 2013) and self-driving (Bojarski et al., 2016). Deep learning algorithms can extract high-level features by utilizing deep neural networks, such as convolutional neural networks (CNNs) (Krizhevsky et al., 2012), multi-layer perceptrons and recurrent neural networks (RNNs) (LeCun, 2015). Scaling up deep learning algorithms is able to discover high-level features in a complex task. Dean et al. (2012) constructed a very large system which was able to train 1 billion parameters using 16000 CPU cores. Coates et al. (2013) scaled to networks with over 11 billion parameters using a cluster of GPU servers. Reinforcement learning (RL) is a machine learning paradigm, where a learning agent constantly interacts with its surrounding environment. At each time step, the agent observes its current state in the environment , takes some action , and then receives a scalar reward . This one-step reward only evaluates how good the previous action is, but the goal of the learning agent is to find a policy that can maximize its total future rewards in the long term given its current state. With the rapid development of GPU cards, it is now possible to train a deep neural network as the function approximator in the reinforcement learning domain. Minh et al. (2013) introduced deep reinforcement learning (DRL) algorithm using experience replay and deep Q networks (DQN) to learn a Q function, which is able to play various Atari 2600 games better than human players. Experience replay allows an online learning agent to t s t a 1 t r + 1 t r + 26 random sample batches from its past experiences to update state-action values, thus breaking the correlations between consecutive frames. By combining supervised learning and reinforcement learning approach, the group at Google DeepMind has further proven that their deep learning algorithm can beat a world champion in the most challenging classic game Go (Silver et. al., 2016), which has extremely large number of possible configurations, and is difficult to evaluate board positions. Schaul et al. (2016) further developed a prioritized experience replay framework to sample more important transitions and learn more efficiently. Through trial-and-error, RL is the type of algorithm that most resembles the way humans and other animals learn to solve a task, compared with supervised and unsupervised learning algorithms. But so far there has been only limited amount of RL applications in robotics domain (Figure 7) due to the following reasons. 27 Figure 7: Autonomous inverted helicopter via reinforcement learning First, the robotics world is intrinsically of high dimensionality, which usually consists of the continuous state and action space. RL algorithms can be typically categorized into value-based approaches (DQN) and policy-based approaches. One limitation of the DQN algorithm is that it can only handle discrete action space. Lillicrap et al. (2016) proposed a deep deterministic policy gradient (DDPG) algorithm which can learn policies in a high dimensional, continuous action space. The DDPG algorithm, inspired by the DQN algorithm which uses experience replay and target network to stabilize learning process, is using the actor-critic approach based on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014). Mnih et al. (2016) further introduced asynchronous advantage actor-critic (A3C) where multiple agents are running on multiple instances of environments at the same time. Each agent has its own set of network weights 28 and its experience is independent of others’. A3C algorithms estimate both the value functions (critic) and the policy (actor) and have been shown to solve various continuous tasks using only a multi-core CPU. Second, the reward is always sparse and difficult to design. The agent’s learned behavior is governed by the reward function. The designer needs to tune the reward function to achieve desired agent behaviors, which is often referred as reward engineering or reward shaping and requires designer’s domain knowledge. Unlike Atari games where a score is a suitable measure for evaluating the performance of a player, real world robotic tasks always lack a good reward function. Without a good reward function, good behaviors do not necessarily get encouraged through higher rewards and the agent will get stuck in a sub-optimal policy easily. To deal with the problem of sparse reward, Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER) to increase sample efficiency. Suppose an agent start from an initial state and is trying to reach goal state , but ends up in another state at the end of the episode. The reward function is set to be binary, 0 or 1. HER algorithm imagines that if the goal state is , then this experience becomes a successful one and the agent can receive a +1 reward, which increases the chance of the agent getting the positive reward. HER algorithm has been implemented with a robotic arm to push, slide, and pick-and-place objects (Figure 8). Figure 8: Robotic application of hindsight experience replay 0 s g s T s T s 29 Third, since the real-world experience is always difficult and expensive to produce, most of RL algorithms are first trained in a simulated environment. The lack of real-world training samples and experiences makes the computer agent perform poorly in the real world. The ultimate convergence to an optimal policy requires a thorough exploration of the exploitation of the environment. Due to different dynamics between the simulated and real world, a computer agent which can function perfectly in the simulations is not guaranteed to perform well in the real world. Chen (2016) developed a decentralized multi-agent collision avoidance algorithm based on deep reinforcement learning. Two agents were simulated to navigate toward their own goal positions and learn a value network which encodes the expected time to goal, and the solution was then generalized in multi-agent scenarios. It was also shown that the agent was able to avoid collisions with a non-cooperative agent. Deep learning algorithms have been successful in achieving end-to-end learning. Dieleman and Schrauwen (2014) investigated whether it is possible to apply feature learning directly to raw audio signals by training convolutional neural networks, in an automatic tagging task. Traditionally content- based music information retrieval tasks are resolved based on engineered features and shallow processing architectures, which relies on mid-level representations of music audio, e.g. spectrograms. The results showed that even though the end-to-end learning does not outperform the spectrogram-based approach, the system is able to learn automatically frequency decompositions and feature representations from raw audio. Self-driving took off in the last several years and heavily relies on the advances in deep learning, which serves as an indispensable part for tasks such as perception and path planning. Traditional approaches are mostly based on computer vision techniques to extract 30 hand-coded features from images and videos. The breakthroughs of convolutional neural networks (CNNs) in recent years demonstrate that these features can be learned automatically. A group at NVIDIA proposed an end-to-end learning approach for self- driving (Bojarski et al., 2016). A convolutional neural network is trained to map steering commands directly from raw pixels of the camera input. These images are obtained by three cameras mounted on real cars. Human steering angle is recorded together with the camera images at each time-step (Figure 9). The driving data includes driving in various roads, lighting and weather conditions and the data have been augmented through random shifting, flipping, etc. to increase learning efficiency. The weights of the network are updated to minimize the error between the human steering angle and the network’s output using back-propagation. The system automatically learns internal processing steps such as detecting useful road features with only the human steering angle as the training signal. It is shown that CNNs alone can learn the entire task of lane following without any human intervention. This end-to-end learning approach is challenging in that it requires a huge amount of training data and the advantage is that it eliminates the dependence on the designer’s prior domain knowledge. 31 Figure 9: NVIDIA’s end-to-end learning approach to self-driving car 32 2.4 Transfer learning Given a complex task which is difficult to learn directly, transfer learning is a commonly used technique which can generalize previously learned experience and apply this experience into new tasks. Transfer learning refers to utilizing knowledge gained from source tasks to solve a target task. It is believed that in a reinforcement learning context, transfer learning can speed up the learning agent to learn a new but related task (i.e., target task) by learning source tasks first. Tayler and Stone (2007) introduced a transfer algorithm called Rule Transfer, which summarizes source task policy, modifies the decision list and generates a policy for the target task. Rule learning is well understood and human readable. The agent benefits from the decision list initially and continues to refine its policy through target task training. It was shown that Rule Transfer could significantly improve learning in robot soccer using learned policy from a grid-world task. Fernandez and Veloso (2006) proposed two algorithms to address the challenges of Policy Reuse in a reinforcement learning agent. The major components include an exploration strategy and a similarity function to estimate the similarity between past policies and new ones. The PRQ-learning algorithm probabilistically bias an exploration learning process by using a Policy Library. In the second algorithm called PLPR, the Policy Library is created when learning new policies and reusing past policies. Torrey (2006) introduced the induction logic programming for analyzing previous experience of source task and transferred rules for when to take actions. Through an advice- taking algorithm, the target task learner could benefit from outside imperfect guidance. A system AI2 (Advice via Induction and Instruction) for transfer learning in reinforcement learning was built, which creates relational transfer advice using inductive logic 33 programming. Based on a human-provided mapping from source tasks to target tasks, the system was able to speed up reinforcement learning. When utilizing deep neural networks in transfer learning, a base network on a base dataset is first trained on a source task, and the learned features are then transferred to the target network to be trained on a target dataset and task. If the size of target dataset is much smaller than source dataset, transfer learning is always beneficial to train a large target network. One common practice is to copy the first n layers of the base network to the first n layers of the target network while the remaining layers of the target network are randomly initialized and trained. Yosinski (2014) presented a way to determine whether a certain layer is general or specific. It was found that initializing a network with transferred features from almost any layers could boost the performance after fine-tuning to a new dataset. Dint et al. (2016) proposed a task-driven deep transfer learning framework for image classification, where the features and classifiers are obtained at the same time. Through pseudo labels for target domain, the system could transfer more discriminative information to the target domain. In their survey paper, Taylor and Stone (2009) concluded several metrics to measure the effectiveness of transfer in reinforcement learning, including: 1. Jumpstart: Initial agent’s performance may be improved in the target task. 2. Asymptotic Performance: The agent’s final performance (reward) may be better in the target task. 3. Total Reward: The total accumulated reward through the learning process can be higher in the transfer case. 34 4. Transfer Ratio: The ratio of total accumulated reward of the agent with transfer and the total accumulated reward of the agent without transfer. 5. Time to Threshold: The learning time needed by the agent to reach some certain level may get reduced. Figure 10: Effectiveness of transfer in reinforcement learning In this research, the measures that are used to evaluate transfer effectiveness are not limited to the ones mentioned above, but also includes one important factor – learning variance. Learning variance is a common phenomenon in reinforcement learning. Running the same learning algorithm multiple times usually result in different sub-optimal policies, and the learning agent will never jump out of the local minima. Parisotto et al. (2016) proposed a transfer reinforcement learning approach (Actor- Mimic) where an intelligent agent is trained through multiple tasks simultaneously and 35 generalize the learned knowledge to new domains. The proposed approach is tested to play a series of Atari games, which can speed up learning in new tasks without prior expert guidance. This paper assumes that the Atari games that are chosen for transfer learning have to be similar but does not provide a quantitative measure of determining the inter-task similarity. The authors introduced two objective functions – a policy regression objective and feature regression objective – to train a single policy network based on the insights of several experts, and mimic expert decisions for multi-task learning. The policy regression objective function adopts the concept of policy distillation (Hinton et al, 2015), which compresses knowledge from an ensemble of different models into a single smaller model. The output layer of the network converts the logit of each class into a probability . Eq. 2 where is the temperature. Higher temperature produces a softer distribution over different classes which contains more information from the cumbersome ensemble model than the true hard targets. Liu and Jin (2018a) proposed a transfer reinforcement learning approach which involves two key concepts of transfer belief and transfer period. The proposed approach was tested in a game simulator for collision avoidance where only static cases with high inter-task similarity were considered, and it was shown that transfer learning could boost learning as well as bring variance to the student learning process. In another publication, Liu and Jin (2018b) continued to investigate the effects of proposed TRL approach under low inter-task similarity, by varying transfer belief and transfer period. To date not much research has been aiming to combine deep reinforcement learning and transfer learning to i z i q exp( / ) exp( / ) i i j j zT q zT = å T 36 solve robotic collision avoidance problems, because (a) it is difficult to directly learn from raw pixel or distance sensory inputs, and (b) it requires large amount of training data, which is not easy to generate in real life, (c) the reward function is difficult to design, (d) the inter- task similarity is poorly understood in different task settings. This research aims to close the gap between real world collision avoidance and deep learning by studying transfer reinforcement learning at different levels of inter-task similarity and developing transfer strategies accordingly. 37 3. Belief-based TRL Approach 3.1 Overview Before getting into details of the mechanism for collision avoidance, we first introduce the basic idea and our overall goal of research on integrated transfer reinforcement learning for developing intelligent systems. Machine learning is often divided into supervised learning, unsupervised learning and reinforcement learning. Reinforcement learning has the advantage of learning from the agent’s own experience and the agent learns to pick actions at any state to maximize the total rewards by interacting with the environment (Figure 11) through a scalar reward signal. Although reinforcement learning allows agents to acquire collision avoidance skills (Mataric, 1998; Fujii et al, 1998; Frommberger, 2008), one challenge is that it requires a large amount of training data, which is always hard to get in real life considering the expense of building physical system and conducting experiments. Figure 11: The agent-environment interaction in reinforcement learning On the other hand, recent progress in self-driving car research (Bojarski, 2016; Ohn-Bar & Trivedi, 2016) and deep learning, e.g., Alpha-Go (Silver et al, 2016; Chen, 2016; Churchland & Sejnowski, 2016; Wang et al, 2016), have demonstrated that the experience of human “experts” represents a highly valuable source of intelligence and can be learned by machines through deep learning for building intelligent systems. However, 38 in many situations, the access to human expertise data can be very limited, since it is hard, if not impossible, to acquire human experience data in all possible situations. How to effectively and efficiently combine human expertise with machine driven self-learning remains to be a challenge. In this research we divide the learning method into direct learning and transfer learning (Figure 12). The areas that are covered are (deep) reinforcement learning using direct learning as well as transfer learning. In our research, we consider that a machine’s “intelligence” is dependent on three fundamental capabilities: • First, it must be able to “exploit” the existing knowledge or expertise to the maximum extent so that all the known situations can be dealt with. This capability corresponds to transfer learning at a macro scale and deep learning mechanisms at a micro scale. • Second, the machine must be able to “explore” the unknown territories and develop new knowledge or expertise from its own experience. Reinforcement learning is a candidate for this capability. • Lastly, depending on the level of dynamics of the task domain or environment, the machine must be able to “adapt” the ratio of exploitation over exploration in order to stay effective or even alive. More dynamic or changeable domains require more exploration. Human design or meta-level learning mechanisms are needed to deal with this issue. In a long run, we attempt to develop an integrated machine learning technology that can (a) learn from multiple experts from diverse domains, (b) apply the learned expertise to explore new domains (e.g., a new domain requiring multiple domain expertise, or a new 39 domain becoming more complex), and (c) manage its own learning processes (i.e., exploitation and exploration) according to the change in task domains. The “domains” can be knowledge domains, such as mechanical design, and technical domains, such as robotic (e.g., robot, car, ship) collision avoidance. For simplicity, our current focus is on technical domains. Figure 12: Research areas In this research, we seek to develop an integrated learning mechanism that can take advantage of existing steering experiences from either humans or other robots to learn about actions in new and more complex situations. More specifically, we propose a transfer reinforcement learning approach on top of implementing the deep reinforcement learning algorithms, by introducing two key concepts: transfer belief and transfer period. By combining the experience from the “expert”, the agent is able to reduce trial time and learn about more complex tasks faster. Supervised learning Unsupervised learning Reinforcement learning Transfer learning Direct learning Learning algorithms Learning method 40 3.2 Deep reinforcement learning We start with some conventional notations in reinforcement learning (RL). RL can be modelled as a Markov Decision Process (MDP), which is a tuple : • : state space, the set of states of the environment. • : action space, the set of actions that the agent can choose from. • : transition probability matrix, • : reward function, • : discount factor, , which determines the importance of future rewards. This thesis is focused on episodic RL problems, where the agent’s experience can be broken into episodes. An episode consists of sequences of states, actions, and rewards, starting from an initial state until the agent reaches the terminal state. The goal is to find a policy , that can maximize the total reward per episode. In this research we use the deterministic policy . One may find the stochastic policy in other literature, which is defined as the conditional distributions of actions given states: . RL can be divided into model-based learning and model-free learning. In real world, most problems are model-free, for instance., the transition probability matrix in unknown and the agent cannot predict future state for executing certain action. Q-learning algorithm is a popular off-policy algorithm of model-free RL, which learns a value for each state- ,, , ,g SAP R S A P [ ] 1 |, a ss t t t SsS sA a ¢ + ¢ == = = P P R [ ] 1 |, tt t rS sA a + === R E g [ ] 0,1 gÎ p () as p = [ ] (| ) | tt as PA aS s p == = 41 action pair and perform updates based on Bellman equation at iteration i (here denote the current state, action, immediate reward, and next state, respectively): Eq. 3 We then continue to summarize the deep reinforcement learning (DRL) algorithm (Mnih et al, 2013), which was originally applied to play Atari games. For an infinitely large state space such as a game window (a big array of pixel values), it is impossible to build a lookup Q-table and learn an optimal value for each state-action pair. DRL uses deep neural network as the function approximator to approximate Q values. A Q-network with weights can be trained by minimizing the loss function at each iteration i, Eq. 4 where is the target Q-value for iteration i. The Q-network in the max operator is called the target network. The gradient is calculated by the following: Eq. 5 Various approaches have been proposed to stabilize and boost the learning process. The neural network structure in this research is built based upon the following three approaches: 1. Experience replay and target network (Mnih et al, 2013) The agents’ experiences, – (a tuple of <state, action, reward, next state>), are stored into a replay memory, (N is the capacity of the replay ,, , sar s¢ 1 (, ) max ( , ) , ii a Qsa r Qsa sa g + ¢ éù ¢¢ =+ ëû E i q ( ) ii L q 2 () ( (, ; ) ii i i yQsa qq éù = - ëû L E 1 max ( , ; ) ii a yr Qsa gq - ¢ éù ¢¢ =+ ëû E ( ) ,, , ' 1 () max ( , ; ) (, ; ) (, ; ) ii ii sars i i i a rQsa Qsa Qsa qq qg q q q - ¢ éù ¢¢ Ñ = + - Ñ êú ëû L E 1 (, , , ) ttttt esars + = 12 ,,..., N ee e = D 42 memory). The replay memory has a fixed number of recent experiences. As new ones come in, older ones will pop out. When training starts, mini-batches are randomly sampled from and applied to update network weights, in order to prevent the network from learning only what it is recently doing. The agent selects an action according to the - greedy policy. The advantage of experience replay is that it increases data efficiency and breaks down the correlations between consecutive experiences. The target error signal in Eq. 3 is calculated by Eq. 6 Since the network weights are updating and changing over time, if the same network is used to evaluate the target error, the learning is more likely to become unstable. Thus, a target network (the network at previous iteration i 1) whose weights are kept fixed when optimizing at the iteration, is used to stabilize the learning process, 2. Double DQN (van Hasselt et al., 2015) In this paper, the authors claim that the standard DQN algorithm suffers from the problem of overestimations in some Atari games. The double DQN algorithm is able to solve this problem, by separating the target network which is used for action evaluation, from the current network which is used for action selection. At each time step, the agent’s experience, is randomly assigned to update one of the two networks. To rewrite Eq. 5 with a few modifications of subscripts to keep consistency with the paper (van Hasselt et al., 2015), the target signal is Eq. 7 D e 1 max ( , ; ) ii a yr Qsa gq - ¢ éù ¢¢ =+ ëû E - () ii q L th i DQN 11 max ( , ; ) tt t t a yr Qsa gq - ++ =+ 43 The original DQN algorithm copies the target network weights to the online network every steps and then kept fixed on all other steps. In the Double DQN algorithm, the target signal becomes Eq. 8 The selection of action is still based on the online network weights . The use of a second network will further stabilize the learning and reduce overoptimism. 3. Dueling DQN (Wang et al., 2016b) In standard DQN, at each update of the Q values, only the value for one of the actions is updated whereas others remain untouched, and the output of DQN is each state- action pair value given the current state. Dueling DQN separates the Q value into a state value and action advantages (Figure 13), so that when some experience is sampled for updating the weights, the state value is updated more frequently instead of updating only the value for that specific state-action pair. Dueling network is useful to learn which states are or are not valuable, without having to explore each action for each state. t tt qq - = DoubleQ 11 1 (,argmax (,;); ) tt t t tt a yr Qs Qsa gqq ++ + ¢ =+ t q t q ¢ 44 Figure 13: The standard DQN (top) and the dueling network structure (bottom) 45 3.3 Transfer reinforcement learning (TRL) An expert network is obtained by training through a source task. It is a common practice to use - greedy policy in reinforcement learning to balance exploration and exploitation. The traditional - greedy (Figure 14 left) policy can be expressed as: with probability of exploration and probability 1 - of exploitation, is often linearly decreasing through the learning process: Eq. 9 where is the total exploration length where is annealed to the minimum value. In order to efficiently balance exploration and exploitation in the target task, a new transfer phase is added to the - greedy policy, which involves two crucial concepts: transfer belief and transfer period. The definitions are given as below. Definition 1 - Transfer belief β: how much confidence the agent puts in expert experience. Mathematically, transfer belief is the probability of the agent picking transfer action suggested by the expert network. Definition 2 - Transfer period Γ: how long the agent decision is influenced by expert network. After the transfer period, the agent will explore the environment based on its own using traditional - greedy policy (Note: ). e e e e e 0 explr (1 ) t T ee= - explr T e e e explr T G£ 46 Transfer belief is linearly decreasing over the transfer period: Eq. 10 Then, the new - greedy (T stands for transfer) policy consists of three phases: transfer phase, exploration phase and exploitation phase, which is expressed as following (Figure 14 right): (a) Transfer: With probability , the agent will pick the transfer action. The definition of the transfer action will be given in the next Section 3.4. It refers to the action(s) suggested by the expert network. Note if , which means that the transfer period is over, . (b) Exploration: With probability , pick a random action. (c) Exploitation: With probability , pick the best action calculated by the agent’s current network. 0 (1 ), for 0, for t t t b b ì -£G ï = G í ï >G î T e 1 ,( ) pfort b= £G t>G 1 0 p = 2 (1 ) p eb = - 3 (1 )(1 ) p eb = - - 47 Figure 14: - greedy policy (left) and - greedy policy (right) Transfer belief increases the probability of picking transfer actions suggested by the expert network. In general, these transfer actions may or may not help the agent perform the target task better. But when the source task and target task are similar, the learned features/knowledge from the source task can help the agent make better choices so that it does not have to extensively try every action. By altering the length of transfer period, the designer can control how long the agent’s decision policy is influenced by the expert network. With a small transfer period, the agent only receives guidance from the expert in the beginning and then explores the environment by itself after the transfer period, which does not differ too much from direct learning without transfer. Given a larger transfer period, the agent will continuously extract knowledge from the expert network and pick actions based on that. e T e 48 3.4 Transfer action In the proposed TRL framework, the agent has different probabilities of choosing a random action, transfer action, or the current best action. In this section, the definition of transfer action will be discussed in detail. This section also covers transfers from single and multiple source task(s) (Figure 15), and how to build a conflict resolution mechanism in the case of multiple source tasks. Consider the transfer reinforcement learning problem where the number of total available (valid) actions in the action space is n, there are K similar source tasks, each yielding an expert network after training. The goal is to transfer learned experience from these K source tasks to a target task. In this research, the number of target tasks is 1, and the agent has 7 actions in total . The number of source tasks K can be 1 or more. Definition 3 – Transfer action space : the subset of the whole action space which contains all transfer actions. Definition 4 – Enforcing action space : the subset of the whole action space which contains the highest actions (which have the highest Q-values) of every expert network. Definition 5 – Avoiding action space : the subset of the whole action space which contains the lowest actions (which have the lowest Q-values) of every expert network. ( ) 7 n= TF A enf A h h avd A µ µ 49 (a) If there is only one source task and one target task, a neural network is first trained through this source task, i.e., expert network. When learning the new target task, the agent has a probability (transfer belief, Section 3.3) to choose the transfer action. And the transfer action is simply any item (component) in the transfer action space. In this research, and are both set to be 3, . And it is assumed that , which guarantees that the enforcing and avoiding actions space have no shared action. In this case, the enforcing and transfer action space are the same, containing the best 3 actions of the expert network. (b) If the agent is given multiple source tasks, , suppose the agent has been trained to solve each individual source task and obtained corresponding expert networks . Figure 15: Single and multiple source tasks The outputs of each expert network (Q-network) are the Q-values of each action, which can be ranked from highest to lowest. For expert , after the ranking process, the ( ) 1 K = 1 p b = h µ 3 h µ == n h µ >+ TF enf = AA 12 ,,... K SS S 12 ,,... K EE E i E 50 Q-values are rearranged as . For indexing purpose, ranking Q-values results in a new matrix , where is the best action in the expert network (Figure 16). Here refers to the element in the matrix, and is the same as where the expert network is emphasized, = . Figure 16: Ranking all actions’ Q-values Thus, the enforcing action space for all experts is the set of all actions with the highest Q-values among all experts, ; and the avoiding action space that contains all the actions with lowest Q-values is . Considering that both and may have repetitive elements, which means that some experts are enforcing and avoiding the same action, the enforcing and avoiding action spaces can be rewritten as following: ( ) ( ) ( ) ( ) ( ) ( ) 12 3 2 1 ... iii i i i nn n EEE E E E Qa Qa Qa Qa Qa Qa -- ³³³³ ³ ³ ranked A ij a th j th i ij a ranked A i j E a i E ij a i j E a { } 12 3 enf , , 1,2,..., iii EEE aaa i K == A { } 21 avd , , 1,2,..., ii i nn n EE E aa a i K -- == A enf A avd A 51 Eq. 11 Eq. 12 In the above equations Eq. 11 and Eq. 12, means that action occurs times in the enforcing / avoiding action space. and are the numbers of unique elements in the enforcing and avoiding action space, respectively. In transfer action space , enforcing action space or avoiding action space , repetitive elements are used to retain the original occurring frequency. It is common that there exists a conflict among different expert networks, or mathematically . In this case, it requires a conflict resolution mechanism among experts to further guide agent’s decision making. Definition 5 – Expert priority : the quantitative measure of how important each expert experience is. Each expert priority is between the range 0 and 1 and the total sum of expert priority is 1. Eq. 13 { } { } enf enf 11 22 enf enf enf enf enf enf enf enf enf enf , ,..., 1,2,..., NN ii aa a ai N rr r r = == A { } { } avd avd 11 22 avd avd avd avd avd avd avd avd avd avd , ,..., 1,2,..., NN ii aa a ai N rr r r = == A a r a r enf N avd N TF A enf A avd A enf avd d , an aa a $Î Î !! ! AA E P i E P 1 1 i k E i P = = å 52 Take collision avoidance task as an example, the agent needs reach a target goal area while avoiding collisions with other obstacles. It needs to constantly balance the two objectives, 𝐸 " : follow_goal and 𝐸 # : avoid_collision. Here the objectives and the expert networks are used interchangeably. The follow_goal objective requires the vehicle / agent to head towards the goal direction, and the avoid_collision objective requires the agent to not move towards the obstacle direction. Consider the following two tasks (Figure 17): Task A (left) – the environment only has a goal (no obstacles); Task B (right) – the environment only has random moving obstacles (no goal). Both tasks are easy to solve and each objective can be simply formulated as a policy for agent’s decision making (learning is not necessary in both tasks due to their simplicity). Thus, for Task A, the optimal policy is to move towards the goal direction; and for Task B, the optimal policy is to avoid going into the obstacle directions. Figure 17: Multiple policies in collision avoidance 53 If there is no conflict among experts, i.e., and have no shared action, the transfer action space is the set of the suggesting actions from all experts. If there is conflict among experts, it can be further divided in two categories: (1) with expert priority, and (2) without expert priority. In situations where the conflicting experts do not have priority, the agent will randomly trust one of the conflicting experts. Thus, the transfer action space is again the set of all enforcing actions, . However, in collision avoidance task for example, : avoid_collision objective has the absolute priority. As a result, . The agent needs to cancel out those actions which are in the avoiding action space from the transfer action space, according to the expert with priority. Eq. 14 In the above Eq. 14, the operator returns 1 when the argument is true inside the parentheses and returns 0 when the argument is false inside. In general, given each expert priority , if exists in the avoiding action space, specifically in the column of (Note may not be unique, i.e., is avoided by multiple experts, and is a list of multiple values), the transfer action space is calculated by the following, enf A avd A TF enf = AA TF enf = AA 2 E 12 0and 1 EE PP == ( ) { } ( ) ( ) { } e av f d 11 1 TF enf enf enf enf enf 11 1 enf enf nf en avd 1,..., 11,..., aa i N aa iN r r = ×Î = -Î = - = 1 1 AA A A ( ) × 1 1 ,..., K EE PP enf i a th x ranked A x enf i a x 54 Eq. 15 where converts into an integer. It is worth noting that expert with absolute priority is a special case of above equation, where . ( ) { } ( ) ( ) 11 1 TF enf enf enf enf enf 11 1 enf enf enf avd av nf d e max 1,..., int 1 max 1,..., E E aa Pi N aa PiN x x x x r r - ìü = íý îþ = ×Î× = éù -Î × = êú ëû 1 1 AA A A int( ) ! max 1 E P x x = 55 3.5 Agent learning behavior and reward design A video game was created in Pygame to conduct case studies of transfer reinforcement learning on collision avoidance. The game environment consists a RL agent (green), obstacle and boundary walls (red), and goal area (orange), as shown in Figure 18. Through interacting with the game environment, the agent needs to figure out how to reach the goal area while avoiding collisions with obstacles and walls. Currently the orange goal is set to be a rectangular area, so that the agent is more likely to reach the goal during the early learning stage when it randomly selects actions to explore the environment. It is believed that after clearing all potential collisions, the agent will be able to find a path to the goal by a simple path planning algorithm. • The state is defined as the array of pixel RGB values of the game window, as shown in Figure 18. Figure 18: Game environment 56 • The action space is composed of seven discretized actions with different combinations of linear and angular velocity, through , as indicated in Table 1. Table 1: Agent actions Action (pixels) (radians) 5 0.35 5 0.2 5 0.1 10 0 5 -0.1 5 -0.2 5 -0.35 • The reward function is defined as: Eq. 16 The design of reward function is a crucial part for DRL, which serves as a guiding signal for the agent to make progress during the learning process. In the Atari games (Figure 19), the reward / score is easily obtained and is well-designed so that an experienced player would easily achieve a higher score than an amateur. In most cases, reward function can be hard to design and usually sparse. 1 a 7 a v w 1 a 2 a 3 a 4 a 5 a 6 a 7 a 200 if reach goalposition 900 if hit anyobstacle 1 otherwise r ì ï = - í ï - î 57 Figure 19: Rewards in an Atari game In this research, we also investigate the detailed reward function design to see how it influences the agent’s learning performance and learned behavior, as shown in Eq. 17. Most part of the reward function is the same. Every episode ends when the agent either reaches the goal or hits an obstacle. The agent receives a positive reward +200 for reaching the goal, and -200 for hitting obstacle. The only difference lies in the shaping reward. For every step from start to end, the agent receives an additional reward, which is computed by and . Thus, for any time step t, the reward function is defined as following, Eq. 17 g r dev r 200 if reach goalposition 200 if hit anyobstacle otherwise gg dev dev r rr ww ì ï = - í ï + î 58 In Eq. 17, is called goal reward, which is the reward of moving towards goal direction, and is called deviation reward, which is the (negative) reward of deviating from the goal direction. In the current reward design, and are computed by a linear function (Eq. 18) of agent’s current location coordinates, and , where is the agent’s deviation from the horizontal center of the window and is calculated by and width is the horizontal length of the game window. are constant coefficients. To make the learning more stable, and are clipped between [-1, 1] and [0, 0.5], respectively. 𝜔 % and 𝜔 &'( are the relative weight on 𝑟 % and 𝑟 &'( . Eq. 18 Figure 20 illustrates the proposed transfer reinforcement learning process. An expert network Ne is first obtained by training through the source task, which involves a single obstacle. In the target task, the agent follows - greedy policy to select actions with probabilities p1, p2, and p3 as described in Subsection 3.2. After receiving a reward rt from the environment, the agent stores the current experience et into the experience replay memory. The currently learned network Nc is then updated by sampling mini-batches from the experience replay, as shown in Figure 20. g r dev r g r dev r y dev x dev x (/ 1/2) dev xabsxwidth = - , , ,and gg dev dev kb k b g r dev r [1,1] [0,0.5] gg g dev dev dev dev rkyb rkx b =+ ì Î- ï í Î =+ ï î T e 59 Figure 20: Agent learning behavior 60 3.6 Task complexity and similarity model In order to design case studies for the proposed TRL framework, the task complexity and similarity is first analyzed in detail. The complexity of a collision avoidance task is influenced by many factors, such as the number of obstacles, the size of (each) obstacle, the dynamics of obstacles (static, moving at a constant speed, or moving at a random speed), the dynamics of the own vehicle / agent (the vehicle may turn quite fast, i.e., the maximum angular velocity is large, or the vehicle is less maneuverable, i.e., the maximum angular velocity is small.), the communication between the own vehicle and other vehicles, etc. Among all these factors, in this research we identify three basic and crucial ones and construct a complexity space accordingly (Figure 21): (a) the number of obstacles, (b) the dynamics of obstacles, and (c) the dynamics of the own vehicle / agent. Other factors are either considered to play a minor role or left for future in-depth research, and thus kept constant through all the case studies. The complexity in TRL setting can be categorized into internal complexity and external complexity (Figure 21). The internal complexity is usually caused by different numbers of actions in the action space, for instance, Task A has 3 actions and Task B has 7 actions; or different semantic meanings of action space across tasks, for instance, in Task C, a = 1 means “Turn Left” but in Task D, a = 1 means “Turn Right”. In the collision avoidance task, the vehicle / agent dynamics is considered as internal complexity, and the number of obstacles and the obstacle dynamics are external complexity. The external complexity is further divided into individual complexity and aggregate complexity. The individual complexity comes from the other individual agent in the environment, which may or may not have interactions with the own agent, and is caused by the obstacle 61 dynamics (i.e., moving direction and speed of the other obstacle / agent). On the other hand, the number of obstacles determines the aggregate complexity, which grows as the number of agents in the environment increases. Figure 21: Types of complexity in collision avoidance 62 As shown in Figure 22, the complexity space consists of three element axes: x - the obstacle dynamics plane (red), y - the number of obstacles plane (green), and z - the vehicle dynamics plane (blue). Figure 22: The complexity space Each task can be mapped into the complexity space. Given two tasks and , (note: OD – obstacle dynamics, NO – number of obstacles, VD – vehicle dynamics), the inter-task similarity is defined as the distance of the two points in the complexity space. Eq. 19 In general, the three factors may not be equally important in determining the inter- task similarity. Thus, three additional weights are needed: Eq. 20 (, , ) ii i i task OD NO VD (, , ) jj j j task OD NO VD ( ) ( ) ( ) 22 2 , ij i j i j i j similarity OD OD NO NO VD VD = - + - + - ( ) ( ) ( ) 22 2 ,1 2 3 ij i j i j i j similarity OD OD NO NO VD VD ll l = - + - + - 63 Designing these relative weights of , , requires a deep understanding of the complexity of a collision avoidance task with respect to detailed factors, which this research aims to investigate and reveal through empirical results. The choices of , , are important for TRL in collision avoidance tasks. The designer needs to develop different transfer strategies (i.e., how to choose transfer belief and transfer period) to meet different levels of inter-task similarity. It will be shown later in case studies that poorly designed transfer strategy will lead to more negative transfer. Transferring across tasks with internal complexity is a crucial process and can be applied in other domains as well. For collision avoidance tasks, if knowledge collected from one vehicle can be transferred to another type of vehicle, it can provide designer a deeper understanding of the underlying reasoning of collision avoidance. As for simplicity, we start with the inner-most cube in the complexity space, which is divided into 8 sub-tasks as shown in Figure 23 and Table 2. To design the weights of , , is equivalent to determining how sensitive inter-task similarity is against each of the three factors / criteria. 1 l 2 l 3 l 1 l 2 l 3 l 1 l 2 l 3 l 64 Figure 23: Different collision avoidance tasks Table 2: Task specifications Task # Obstacle dynamics Number of obstacles Vehicle dynamics 1 Static Single Vehicle type 1 2 Static Multiple Vehicle type 1 3 Static Single Vehicle type 2 4 Static Multiple Vehicle type 2 5 Dynamic Single Vehicle type 1 6 Dynamic Multiple Vehicle type 1 7 Dynamic Single Vehicle type 2 8 Dynamic Multiple Vehicle type 2 65 To investigate the respective importance of each criteria, following transfer paths are possible: • OD: T1 à T5, T2 à T6, T3 à T7, T4 à T8 • NO: T1 à T2, T3 à T4, T5 à T6, T7 à T8 • VD: T1 à T3, T2 à T4, T5 à T7, T6 à T8 Notice that the tasks of above transfers only differentiate in its respective criteria. But in general, one might want to try other transfer paths, for example, from T1 to T8, which is less similar and more difficult to transfer. As the first step, we have conducted three case studies so far and the internal complexity (vehicle dynamics) is ignored (i.e., the same type of vehicle is used). Case I (from T1 to T2) has high inter-task similarity; Case II (from T1 to T6) has low inter-task similarity; and Case III (from T5 to T6) has the medium inter-task similarity. For the dynamic case – Task 6, we assume that obstacles are moving at a constant speed, and the direction will only change when the obstacle is reaching the boundary of the game window. This assumption reduces the task complexity and as a result, previous state has a strong correlation with the later state, which makes it possible to make decisions based on only one static image. It is worth noting that the complexity and similarity model is different from the concept expert priority, which is introduced in subsection 3.4. Expert priority measures the relative importance (weight) of each expert’s experience on transfer process. The importance / priority may be due to safety concerns, in which case the source task and target task may not be similar. Thus, similarity and priority are two different concepts, which might be dependent on each other in certain cases. 66 For a complex task, the agent might need to integrate the knowledge from multiple source tasks. This teacher – student pattern is also observed in human learning process. Every student has multiple teachers, each one specialized in his / her own domain, for example, math, physics, chemistry, etc. Some challenging problems might require interdisciplinary knowledge. Even for the same domain of collision avoidance, it is crucial to integrate knowledge from many simple scenarios and expand it into more complex environment. Take the collision avoidance problem at sea as an example, there are many standard scenarios that the navigation / path-planning system needs to pass and serve as the basic intelligence to handle more challenging task, such as a congested waterway with multiple vessels (Figure 24). 67 Figure 24: Standard scenarios for collision avoidance at sea: crossing situation(top), head-on situation (bottom) 68 4 Case studies and results The collision avoidance game system consists of two modules: a visualization module (Pygame) and a machine learning module (TensorFlow). The visualization module creates graphical display for the system, where it reads the current environment state and simulates kinematics and dynamics. After taking some action, the agent will receive a reward, based on which a replay memory is constructed and sent to the machine learning module. TensorFlow deals with the heavy lifting to sample batches of past experience and update the network weights, and then sends the up-dated weights back to the visualization module, as shown in Figure 25. Figure 25: Collision avoidance game system architecture Visualization Module Pygame • Create graphical display • Initialize environment • Read current state • Take an action • Receive a reward & observe new state • Store into replay memory Machine Learning Module Tensor-flow • Sample mini-batches from replay memory • Update network weights Replay Memory Updated network weights 69 4.1 Transfer reinforcement learning The network structure is the same as the original DQN paper (Minh et al, 2013) with 84*84-pixel input and an output of 7 actions. The case studies were trained using Adam optimizer with a learning rate of 0.001. The discount factor is 0.99. In the source task the agent follows - greedy policy and in the target task the agent follows -greedy policy, with the exploration annealed from 1.0 to 0.1 over the first 1 million frames (1 frame = 1 state). The replay memory consists of 50, 000 most recent frames, and 50, 000 episodes were trained in total, (1 episode = from starting position to ending position). The initial transfer belief is chosen to be 0.9, 0.5, 0.3, or 0.1. The transfer period could be the first 150K, 300K, 700K or 1 million frames. The choice of hyper-parameters is summarized in Table 3. Table 3: Case parameters Source task Target task Replay memory size 50,000 50,000 Mini-batch size 32 32 Discount factor 0.99 0.99 Learning rate 0.001 0.001 Total training episodes 50,000 50,000 1à 0.1 1à 0.1 Annealing frames 1 million 1 million Transfer period (frames) N/A 150k/300k/700k/1 million Initial transfer belief N/A 0.1/0.3/0.5/0.9 g e T e a e 70 4.1.1 Case I: High inter-task similarity (T1 à T2) In the source task, at the beginning of each episode (game play), a random obstacle is generated within the dashed rectangle. In the target task, two random obstacles are generated at the beginning. The sizes of the moving obstacle and the agent are the same. (Figure 26) Figure 26: Source task (left): one static obstacle; Target task (right): two static obstacles Baseline Cases For the purpose of comparison, we established two baseline cases. The first baseline case is for an agent to learn about the “target task – two obstacles” by “bootstrap”—i.e., the neural network is randomly initialized. The dark red lines shown in Figure 27 and Figure 28 indicate the learning performance of this baseline case. As the figures show, starting from scratch requires more time for the agent to learn about the task. Especially, it takes much longer training for the agent to become capable of dealing with the two-obstacle collision avoidance. 71 Another baseline case is “copy expert”—i.e., the weights of the expert network learned from the source task are copied into the learning agent as the initial neural network for the “target task.” After initialization, the agent starts its regular reinforcement learning: the copied expert network weights are updated by following the -greedy policy (i.e., starts from 1 and is annealed to 0.1) to select actions. The red line shown in Figure 27 and Figure 28 indicates the learning performance of this baseline case. The effectiveness of TRL is measured in two ways: learning speed and learning variance. 1. Learning speed Baseline case bootstrap: As shown by the dark-red line in Figure 27, without any input from the expert knowledge, it takes much longer for the agents to learn about the target task. Huge lag appears until around 11K frames point. However, it catches up very fast after that point. The final learning effectiveness within the 50K frames range is inferior. Baseline case copy expert: In this case, as shown of the red color in Figure 27, the starting point for the learning agent is a complete copy of the expert network. Since immediately after the learning process starts, the “expert network” will be updated by following - greedy policy, the “expert supervision” does not really exist. As a result of copy-expert, the learning picks up faster than bootstrap case with almost the constant speed. We believe that the difference in learning speed between these two baseline cases indicates the level of similarity of the source task and target task domain. Transfer reinforcement learning (TRL) cases with varying transfer period: Our primary simulation runs of TRL processes have revealed that the transfer period plays a e e e 72 key role in affecting learning speed. Figure 27 illustrates the learning performance of varying transfer period from 150K, 300K, 700K, to 1M frames with yellow, blue, green and pink colors, respectively. As shown in Figure 27, shorter transfer period Γ means shorter period of expert supervision—i.e., to use expert network Ne to select actions (also see Figure 20). From a learning speed point of view, the results in Figure 27 indicate that longer transfer periods lead to better learning performance, with the effect diminishing as it becomes sufficiently long (after 700K frames). When the transfer period is getting closer to 1 million frames—i.e., the annealing time when decreases to 0.1—the performance decreases. Comparing with the two baseline cases discussed above, the positive impact of expert supervision is considerably large, especially until the 200 (x100) episodes range. Figure 27: Average performance of varying transfer periods e 73 2. Learning Variance In addition to learning speed, we identified the learning variance as an important measure of learning performance since in most intelligent engineering systems, the consistency of learning performance is very much demanded. Figure 28 illustrates the learning variance multiple learning runs with different transfer periods of first 150K, 300K, 700K, 1M frames. Each dashed color represents an independent trial with a random seed. Each transfer period has 10 trials (running 10 random seeds) in total. The red curves are the two baseline cases. The standard deviation of each transfer period case before convergence (from 0 to 200 (x100) episodes) is shown in Figure 29. It can be seen that the variances of different transfer periods share a similar pattern: decreases at beginning, then increases, and finally decreases again as the learning converges. A careful examination of Figure 29 indicates that the overall variances are larger for both short transfer period case (150K frames) and long transfer period case (1M frames), while the 300K-700K transfer period cases appear to have less variance for different learning trials, exhibiting more consistent learning performance of the system. 74 Figure 28: Different student performances under each transfer period Episodes (x100) Episodes (x100) Episodes (x100) Episodes (x100) Γ = 150K Γ = 300K Γ = 700K Γ = 1M 75 Figure 29: Standard deviation plot of varying transfer periods before convergence Episodes (x100) 76 4.1.2 Case II: Low inter-task similarity (T1 à T6) The second case study is considered to have low inter-task similarity, compared with Case I. The source task and target task are shown in Figure 30. Same as Case I, the obstacle is the same size as the agent. In the source task, at the beginning of each episode (game play), a random obstacle is generated within the dashed rectangle. In the target task, the obstacles are moving at a constant speed (20 pixels/time-step), as shown in Figure 30. Figure 30: Source task T1(left): one static obstacle; Target task T6 (right): two moving obstacles Each case result is obtained by running with 8 different random seeds (the shaded area in Figures 31, 32, 33, 34, 35). The darker line shows the average performance of these 8 runs. In each of these cases, the network is first initialized with the pre-trained weights from the expert network trained through the source task. The baseline (green) is constructed by the agent exploring the environment using only - greedy policy. e 77 Figure 31 shows the performance of β0 = 0.9 and Γ = 700K (orange) compared with the baseline. The choice of β0 = 0.9 and Γ = 700K comes from previous study of Case I from T1 to T2. As can be seen, the jump-start is still obvious due to the initial high transfer belief. But overall the boosting effect of transfer learning is rather small. The average performance almost overlaps with the baseline, which implies that the expert experience does not help that much in the new context where target task and source task have low similarity. Two ways to reduce the transfer effect are (a) decrease transfer period and (b) decrease transfer belief. (a) Decrease transfer period: The first option is to decrease transfer period Γ, from 700k to 300k, while keeping the initial transfer belief β0 = 0.9 the same (Figure 32). During the early stage, the performance is better than the baseline. However, after the transfer period, the learning variance starts to grow. Though the maximum performance is still better than baseline, many students perform worse than the baseline. The average performance is slightly higher than baseline, but should not be considered as improvement. 78 Figure 31: Performance of β 0 = 0.9 and Γ = 700k Episodes (x100) 79 Figure 32: Performance of β 0 = 0.9 and Γ = 300k Episodes (x100) 80 (b) Vary transfer beliefs: The second option is to decrease initial transfer belief β0. Various transfer beliefs have been tested: 0.9 (Figure 31), 0.5 (Figure 33), 0.3 (Figure 34) and 0.1 (Figure 35), which measure the initial probability of the agent picking transfer action suggested by the expert network (this probability is linearly decreasing to 0 until the end of transfer period). As is shown in Figure 33, 34, and 35, the jump-start effect is less obvious compared to β0 = 0.9. Additionally, when β0 = 0.5, as shown in Figure 33, many students perform much better than the baseline after the transfer period, and the learning variance is very low. This pattern is unique and cannot be found in other transfer beliefs (β0 = 0.9, 0.3, 0.1). Another interesting pattern is that higher transfer belief (β0 = 0.9) helps the student perform much better in the early learning stage for frequently trying transfer actions, however, this does not guarantee a better performance in the later stage. As can be seen in Figure 31, the average convergence time of β0 = 0.9 is almost the same as the baseline. On the other hand, though the early stage performance is almost overlapping with the baseline in the cases with smaller transfer beliefs, as time proceeds, the learning starts to differentiate from the baseline. 81 Figure 33: Performance of β 0 = 0.5 and Γ = 700k Episodes (x100) 82 Figure 34: Performance of β 0 = 0.3 and Γ = 700k Episodes (x100) 83 Figure 35: Performance of β 0 = 0.1 and Γ = 700k Episodes (x100) 84 4.1.3 Case III: Medium inter-task similarity (T5 à T6) In Case I and Case II, it has been found that in high inter-task similarity case, the optimal transfer belief is 0.9, whereas in low similarity case, the optimal transfer belief is 0.5. In the third study, a new initial transfer belief = 0.7 has been chosen between the range of (0.5, 0.9) as a comparison case. The baseline (without transfer) is marked in green shaded area and the average performance is presented in dark blue color. Two transfer cases have been compared, (1) = 0.5, marked in orange shaded area and red color representing average reward; (2) = 0.7, marked in gray shaded area and black color representing average reward. As can be seen from Figure 36, both transfer cases have jump-start effects and converge to optimal performance much faster than the baseline. Plus, the average reward in the transfer cases almost overlap with each other (black and red curves), but much better than the baseline (blue). However, in the case where initial belief = 0.7, the learning variance is much lower than the case of = 0.5. Compared with the previous two cases, high similarity (Section 4.1.1) and low similarity (Section 4.1.2), it can be concluded that as the similarity decreases, it is better to start with a smaller initial belief, which can not only boost learning speed but also reduce learning variance. 0 b 0 b 0 b 0 b 0 b 85 Figure 36: Performance in the medium inter-task similarity case Episodes (x100) 86 4.2 Transfer path In this section, two different transfer paths have been tested: (a) single source task, (b) multiple source tasks. The target task is the same, which has a goal and two dynamically moving obstacles (Figure 37). In case (a) the single source task contains a goal and one moving obstacle. The only difference from the target task is the number of obstacles. In case (b), the agent has two source tasks. Source task contains only a goal and source task contains only two moving obstacles. All cases in this section do not have boundaries, which can be treated as open water in real world. For single source task, the agent will first be trained to solve the source task and then transfer the learned experience to the target task. For multiple source tasks, S1 refers to the source task containing a goal (no obstacles); S2 refers to the source task containing only two moving obstacles (no goal). The agent will be trained to solve S1 and S2 separately, and two expert networks E1 and E2 can be obtained after learning. Each source task is simple enough and the expert network policies can be described as, E1: move towards the goal. E2: avoid moving into the obstacle direction, other actions are all valid. The agent can choose any valid action to randomly move around in the environment. E2 has the absolute expert priority (Section 3.4), . If E1 and E2 have conflicting actions, the agent will prioritize the enforcing actions and avoiding actions in E2. In this way, the agent will treat the actions that lead to obstacle directions as invalid. 1 S 2 S 12 0, 1 EE PP == 87 Figure 37: Different transfer paths: (top) single source task; (bottom) multiple source tasks There are other ways to set up the expert priority for E1 and E2. For instance, in a more complex task which contains many dynamically moving obstacles. The optimal strategy is to avoid the closest obstacle because it has the largest collision risk. After clearing the danger of colliding with the closest obstacle, the agent can focus on the second closest obstacle, and so on. Inevitably, the agent optimal decision is to move towards the Obstacle Obstacle Goal Obstacle Goal Obstacle Goal S 1 S 2 88 direction of some far-away obstacle to avoid the immediate danger. In this case, will shrink the action space and the agent will never be able to find the true optimal strategy. In this research, the maximum number of obstacles is only 2, so is a reasonable setup. Some modifications of the game environment are needed to incorporate the environment dynamics. The goal position plays an important role in the above target task, upon which the agent’s decision making is dependent. Thus, the simulation has been modified to convert the global image into a local image using relative coordinates (See Figure 38). In this new game, the green circle is the own agent, and the red circle is the moving obstacle. The obstacle position is relative to the own agent. To add more temporal information, the past trajectory of the moving obstacle has been indicated in orange color (most recent 4 time-steps). And a white line is drawn from the center of own agent to the goal (relative) position to indicate the direction of the goal position. When the agent is getting closer to the goal, the length of the white line is decreasing. 2 1 E P = 2 1 E P = 89 Figure 38: Game setup using relative positions The learning performance in different transfer paths is shown in Figure 39. Each color is the average performance through 8 independent runs (red: baseline/without transfer; blue: single source task; green: multiple source tasks). Each run uses the same transfer belief and transfer period k frames. In this case study, learning speed and asymptotic performance are used to measure the effects of TRL. Learning speed (Time to threshold) The learning speed in all three cases do not vary that much. At 350 (x100) episodes, the learning curve converges to the optimal. In the multiple_expert (multiple source tasks) transfer, the agent collects a lot more rewards in the early learning stage (green curve). But this jump-start effect fades away as learning proceeds. Between 50 (x100) and 250 (x100) episodes, the three curves overlap with each other. 0 0.7 b = 700 G= 90 Asymptotic performance After 250 (x100) episodes and before convergence, the blue curve ramps up much faster. When the learning converges to the optimal, different cases end up in different levels of ending rewards. The single_expert transfer produces the best performance (ending reward 270), whereas the multiple_expert transfer and the baseline are almost at the same reward level (ending reward 225). Figure 39: Different transfer paths – single source task vs. multiple source tasks » » Episodes (x100) Reward 91 4.3 Reward tuning To test different reward settings, four random static obstacles are generated randomly within the dashed rectangle area at the beginning of every episode. In Pygame, the window size is 400*400. The green agent (30*20 size) is trying to reach the its goal of orange area. Red circles represent static obstacles with a radius of 7 pixels (Figure 40). When the centers of green agent and red obstacle are less than 30 pixels, it is considered as collision. It should be noted that the colors of the agent, obstacle and goal are only for humans to distinguish the object on the game board. The colors do not participate in the training process. So all pixel values are converted to grayscale during the preprocessing prior to the training process. Figure 40: Game setup 92 At every step, the agent stacks four previous frames together. The agent’s state is an array of pixel values of these 4 consecutive frames. The action space consists of different combinations of linear and angular velocity, as indicated in Table 4. When the environment changes, the agent will maintain its constant speed but will turn into different directions according to the angular speed associated with the given action. Table 4: Action space in reward tuning case study The size of the game window is 400 x 400 pixels. During preprocessing, the window is first scaled down to 84 x 84 pixels, and then converted to grayscale since the color should not influence agent’s decision making. At each step, the agent stacks 4 previous frames together, which are then fed into the neural network. The network structure is the same as the original DQN paper (Mnih et al, 2013) with 84 x 84 pixels input and an output of 7 actions. The experience replay size is 50,000. At every step, a mini-batch of size 32 is randomly sampled from the experience replay and used to update the network. 93 All the case studies were trained using Adam optimizer with a learning rate of 0.0001. The discount factor is set to be 0.99. The agent follows 𝜖 -greedy policy, with 𝜖 being annealed down from 1 to 0.01 during the first 1 million frames. The weight on goal reward 𝜔 % is 1. The weight on deviation reward can be 0, 0.5, 1.0, 1.5, or 2.0 for this study. Table 5: Choice of hyperparameters in reward tuning case study In this section, the reinforcement learning agent’s performance is evaluated by the following measures: • Learning speed: Time for the reward to reach a given level • Success rate: The percentage chance of the agent reaching the goal without hitting any obstacle. • Average time of reaching goal: The average time of reaching the goal position in all successful simulation runs. g dev w 94 Figure 41 shows the average reward of each case. The x-axis is the number of episodes into the training process, and the y-axis is the average reward. The different colored lines correspond to different deviation weights. The deviation weight 𝜔 &'( = 0, that is Eq. 16 is used as reward function, is treated as the baseline case. From Figure 41, it can be seen that adding a small (negative) deviation reward can boost learning speed. The reward plots of 𝜔 &'( = 0.5,1.0,1.5 (orange, green, red curves) ramp up much faster than the baseline. At 100x100 episodes, it is about three times (3x) better, and at 150x100, about one and half times (1.5x) better. Since the shaping rewards in these cases are all different, there is not suitable to compare the agent’s average reward after the convergence. In fact, they all converge to a similar range of reward (100~150). 95 Figure 41: Reward plotting of different deviation weights 96 To further evaluate the agent’s learned behavior in different cases, 500 random tests were carried out for each 𝜔 &'( setting, where the success rate and the average time of reaching the goal were recorded. Figure 42 summarizes the average time of reaching the goal and the success rate of different cases with varying deviation weights. The baseline again was set as 𝜔 &'( = 0. All the statistics of average goal reaching time have been converted to relative percentage of the baseline. As can been seen, after adding deviation reward (penalty) to the reward function, the agent is encouraged to find a shorted path of reaching the goal. Compared with the baseline, it saves 5%, 14%, 11%, 11% of time, as well as energy, in cases where = 0.5, 1.0, 1.5, 2.0, respectively. In terms of the average goal reaching time, = 1.0 turns out to be the best, spending only 86% of the baseline time. However, in all these cases, the success rate has slightly dropped. Also, if keeps increasing to 1.5 or 2.0, the success rate drops to 0.91 and 0.89, with the average time degrading compared to the smaller value cases. dev w dev w dev w dev w 97 Figure 42: Average goal_reaching time under different deviation weights 98 The simulation results of the baseline case (using the reward function (Eq. 16)) and various tuning cases (using the reward function (Eq. 17) with varying ) have demonstrated the impact of reward functions on the agent’s learning performance and working behavior. It also showed that both the learning process and the learned behavior are sensitive to the small parameter value changes of the reward function at certain value ranges. Following are some insights drawn from the results. • Ending reward and interim reward: The baseline reward function (16) is a typical reward function that is widely used. The major feature is that it has two important step-function like “ending-rewards” (or terminal rewards), one for reaching the goal +200, and the other for hitting an obstacle -200, signifying two opposite endings of the episode. The “-1” in (Eq. 16) and the added 𝜔 % ∙𝑟 % +𝜔 &'( ∙𝑟 &'( in (Eq. 17) are “interim awards” that are devised to guide the way for the agent to reach its ending point. A close look at the work behavior of the agent trained based on (Eq. 16) reveals that the agent always tries to avoid the whole “congested” water area and takes the side ways to reach the goal no matter how the obstacles in the area are positioned, as shown in Figure 43. By adding the penalty for horizontal deviation 𝜔 &'( ∙𝑟 &'( the agent’s behavior changed: it seeks both the middle opportunities as well as the side safe ways. The added negative reward encouraged the agent to explore more the middle way despite the collision potential. The result is that the average goal reaching time is shortened, but the success rate suffered a bit too. The interim reward provides an effective means to “design” the agent’s behavior and embedding the rules, regulations and heuristic knowledge. However, there will often be a trade-off for the reward design that should be addressed. dev w 99 Figure 43: Different learned strategies • Sparse reward vs. shaped reward: The reward functions with only “ending reward” items are sparse. Although the baseline function (Eq. 16) had “-1” as its interim reward, it is sparser compared to function (Eq. 17) which has more interim items involved. Sparse rewards slow down learning because the agent needs to take many actions before getting any reward. To avoid sparse rewards, the interim reward items are used to shape the reward function into shaped reward function, hence also called shaping rewards. As shown in the shown in Figure 41, shaped reward function (Eq. 17) with more shaping items leads to better learning speed, thanks to the reward gradient created by the shaping items. When the gradient’s effect expands to the point where the reward field is distorted, in this case the success rate falls significantly through a threshold 0.9 as shown in Figure 42, the learning speed drops, as shown in Figure 42 with 𝜔 &'( = 2.0. There seems to be an upper limit for shaping the reward function that is determined by threshold 0.9 of the success rate. Learned strategy Learned strategy Optimal path Goal Area 100 • Positive reward vs negative reward: It is relatively straightforward to set positive and negative ending rewards because they signify wanted and unwanted ending results of the episode. The effect of interim rewards (i.e., shaping rewards) can be highly different depending on whether it is positive reward or negative reward. In addition to encourage the agent to behavior in certain way, the effect of positive interim reward invites the agent to stay long in the game (i.e., the episode) to accumulate rewards, leading to the slow learning process. On the other hand, the added effect of negative interim rewards is for the agent to reach to the end as quickly as possible to avoid penalty accumulation, promoting faster learning process. Figure 41 shows the effect of negative horizontal deviation reward on the learning speed. Again, when the success rate falls significantly with 𝜔 &'( = 2.0 as shown in Figure 42, the learning speed advantage of negative rewards diminishes, and the opposite occurs. 101 4.4 Summary of findings By considering machine intelligence as the capabilities of exploitation and exploration together with adaptation, we developed a transfer reinforcement learning approach that can be tuned to exploit past experience of human experts and other robots and explore the new domain through deep reinforcement learning. Through the simulation- based case studies, we have obtained some useful insights for machine learning based collision avoidance and for our future work in developing integrated intelligent and learning machines. In this section, the proposed transfer reinforcement learning (TRL) approach has been tested in a game environment and proved useful to solve similar complex collision avoidance tasks. The TRL framework is first tested under different levels of inter-task similarity: high, medium, and low. And different transfer paths and reward tuning are also investigated. Following is a brief summary of our findings. 1. High inter-task similarity • In our case studies, transfer period is a crucial component that needs to be adjusted. The proposed transfer learning scheme has two effects, learning speed and variance. Compared to the bootstrap baseline case, the copy expert strategy performed better. The agent learns to solve the target task faster than both bootstrap and copy expert in most transfer learning instances. As transfer period increases, the learning speed 102 increases. It is worth noting that transfer period being too long will slow down the learning, but still faster than the baselines. • The standard deviation plot shows that variance starts to decrease, and then increases, and finally decreases as learning converges. The longer transfer period, the earlier variance starts to increase. As learning proceeds, either short or long transfer period leads to high variance, whereas medium transfer period has low variance. • There exists an optimal length of transfer period when the variance is low and learning is fast. This optimal transfer period is believed to be task-dependent, which is relevant to the inter-task similarity of source and target task. 2. Low inter-task similarity • As a designer, we need to carefully choose transfer belief and transfer period. Most of the time transfer learning will boost learning in a new target task. However, some bad choices of transfer belief and transfer period can bring negative transfer. • One set of transfer belief and transfer period which works well in a certain target task might not work as well in another target task that has different similarity. In fact, blindly sticking with a fixed transfer belief and transfer period is more likely to cause negative transfer – either slows down the learning process or increases the learning variance. • If two tasks have low similarity, it is better to decrease initial transfer belief and keep a relatively longer transfer period, due to two reasons: first, expert experience from the source task does not help the agent as much in the target task; second, 103 transfer period being too short makes the learning vary a lot among different student networks and many students perform even worse than the baseline without transfer. • Compared with lower initial transfer belief (β0 = 0.5), higher initial transfer belief (β0 = 0.9) has more jump-start effect and makes the variance rather small during the early stage. 3. Medium inter-task similarity • Transfer belief β0 = 0.7 gives the best performance. The learning speed is much faster and the learning variance is much smaller than the baseline. • Compared with high and low similarity case, the optimal transfer belief is β0 = 0.7, indicating that as the inter-task similarity increases, the optimal value of transfer belief decreases, which means that the agent needs to trust less on the expert experience and focuses more on exploring the environment by itself. 4. Transfer paths • In the multiple_expert transfer case, it is crucial to create a conflict resolution mechanism among multiple experts, i.e., to determine expert priority for each expert. In this research, given limited complexity of the collision avoidance tasks that are being focused, an absolute expert priority is given to the source task which only contains two moving obstacles. • The multiple_expert transfer has a huge jump-start effect which fades away as learning proceeds. By giving absolute expert priority to the obstacle_only task, the 104 agent is able to reach a relatively good performance in the early stage, much better than the baseline (without transfer), but not good enough to deal with all kinds of different scenarios in the target task. As a consequence, the ending performance is the same as the baseline case. • Even though the single_expert transfer does not differ much from the baseline in the early learning stage, a huge asymptotic improvement is observed in the long term. By following the suggestions from a single related source task, the agent is able to explore the environment more effectively and accumulate more valuable knowledge to deal with the complexity of the target task. • Many insights can be drawn from this case study from an education perspective. Compared with other machine learning algorithms, reinforcement learning algorithms most resemble the human learning process, by learning from past experience, either success or failure. • Learning variance is a common phenomenon in RL, i.e., running the same RL algorithm multiple times could result in different levels of agent performance. It has been revealed that single_expert transfer tends to produce better learning performance in the long term. But in real world, most people start with multiple_expert transfer, by learning from teachers in different subjects (math, physics, chemistry, biology, etc.). And it is quite common that when given an inter- disciplinary task, people (students) always struggle integrating different sources of expert knowledge (teachers). Another path is to follow a single supervisor who is specialized on a specific domain, which is related to the complex task the student is trying to solve, the student could end up with a much better performance. 105 5. Reward tuning • Reinforcement learning can be a feasible way to solve autonomous ship collision avoidance problems and capture useful knowledge given the two key requirements be satisfied: 1) the simulation environment reflects closely the real world, and 2) the behavior design can be carried out. • Reward function is a useful instrument for designing agent behaviors through reinforcement learning. It is important to understand the effect of ending rewards (aka, terminal rewards) and interim rewards (aka, shaping rewards). The former is decided by the goal of the task on hand and the latter by careful tuning for desired agent behaviors. • Designing interim rewards is an effective way to tune the agent’s behavior. However, the designer must understand the state and action space, recognize the sparsity and shape of the reward function and be aware the pros and cons of and of positive and negative rewards. • There is often a valid range to tune an interim reward parameter, from 0.0 to its maximum. The “maximum” is usually determined by certain threshold of the failure rate of the agent’s task. 106 5 Contributions and Future Work 5.1 Contributions This research is inspired by self-organizing systems design with an aim to add learning capabilities to the agent, which is a crucial component in an unknown and dynamic environment. Specifically, this research is within the collision avoidance setting and is built upon one category of machine learning algorithms – deep reinforcement learning (DRL) which has achieved tremendous success in training a computer agent to play video games. The DRL algorithm is able to capture the knowledge in a neural network which outputs a value for possible actions. The input to the neural network is directly the image (an array of pixel values), which is the same as what humans sees when making decisions. The main focus and contribution of this research is to develop a transfer reinforcement learning (TRL) approach which is able to utilize previous learned knowledge. The TRL approach is tested in a customized game for collision avoidance tasks, where the sensory data can be easily converted to an image, and the agent dynamics and environment dynamics can be easily modified. It is shown that TRL can boost learning speed in a complex task. Two concepts are introduced in the TRL approach – transfer belief and transfer period. Transfer belief measures the probability of the agent picking actions which have high values calculated by the expert network. Transfer period is the length of time when the agent’s decision making is influenced by the expert network. In general, the actions which are favored by the expert network may or may not help the agent perform the target task better due to the difference between the source task and target task. But when the source task and target task are similar, the learned features / policy from the source task 107 can help the agent make better choices so that it does not have to extensively try every action. By altering the length of transfer period, the designer can control how long the agent’s decision policy is influenced by the expert network. With a small transfer period, the agent only receives guidance from the expert in the beginning and then explores the environment by itself after the transfer period, which does not differ too much from direct learning without transfer. Given a larger transfer period, the agent will continuously extract knowledge from the expert network and pick actions based on that. The complexity of collision avoidance tasks is categorized into internal complexity and external complexity. The internal complexity is caused by the dynamics of the agent, or vehicle dynamics, which may include the linear/angular velocity, the agent’s mass, shape, etc. The external complexity is further divided into individual complexity and aggregate complexity. The individual complexity refers to the obstacle dynamics (such as moving direction, constant or random speed, whether there is communication between the agent with obstacle, etc.), and the aggregate complexity arises when the number of obstacles increases. The case studies in this research reveals the effects of transfer reinforcement learning, specifically, the two parameters – transfer belief and transfer period. If not carefully designed, negative transfer is mostly likely to occur. In this research, negative transfer is identified as large learning variance, i.e., some students perform well but other students perform much worse. It is found that there exists an optimal transfer belief and transfer period. Under optimal value of transfer belief and transfer period, the learning variance is small and learning speed is faster than without transfer. However, other values 108 of transfer belief / transfer period will result in a relatively larger learning variance. As the inter-task similarity increases, the optimal value of transfer belief decreases. Different transfer paths have also been investigated. In particular if the agent is given multiple expert experience, expert priority has been proposed to resolve the conflicts among multiple experts. The results show that single_expert transfer could boost the agent’s asymptotic performance in the long term, whereas the multiple_expert transfer only has a jump-start effect which disappears as learning proceeds. In this research, the reward function is also investigated in detail for ship agent collision avoidance behavior design. In general, an autonomous agent aims to reach a goal or waypoint in the shortest time in order to save energy, while keeping a certain safe distance from other obstacles to avoid collision. A deviation reward is added to shape the reward function to guide agent’s behaviors. Various case studies have been conducted on different weights of the deviation reward to explore the effect of reward functions. 109 5.2 Future work First, to determine relative weights of different complexity dimension. As discussed in Subsection 3.6, more case studies are needed to determine the sensitivity of inter-task similarity against each one of the three criteria: obstacle dynamics, number of obstacles, vehicle dynamics. By varying the values of transfer belief and transfer period, the optimal choices of these parameters will be found through empirical results, which will in return help calibrate the calculation of inter-task similarity, i.e., the relative weights of , , . Another line of research is to focus on different transfer paths. Preliminary results have shown that single_expert transfer outperforms multiple_expert transfer, given that the single source task and target task are quite similar. However, more research needs to be done to conclude whether it can be generalized to other transfers where source and target task share different levels of inter-task similarity. Additionally, to design a conflict resolution mechanism is always a crucial part in multiple_expert transfer. There are many other ways of determining expert priority, as discussed in Subsection 3.4. It remains unknown how inter-task similarity and expert priority can influence each other. Last but not least, reward function can be designed more carefully. In a more complex task, some other factors might need to be included and more experiments are needed to find the optimal weights for each factor. 1 l 2 l 3 l 110 5.3 Final remarks The results of this dissertation have been presented at the DCC 2018 conference and the ASME IDETC/CIE 2018 conference and have been submitted to AIEDAM Special Issue on Design Computing and Cognition. It shows that this line of research of developing transfer reinforcement learning mechanisms has caught the interests of many researchers in various fields and the results of this research could potentially have huge impact on machine learning community and design community. 111 Works cited 1. Alonso-Mora, J., Breitenmoser, A., Rufli, M., Beardsley, P., & Siegwart, R. (2013). Optimal reciprocal collision avoidance for multiple non-holonomic robots. In Distributed Autonomous Robotic Systems (pp. 203-216). Springer, Berlin, Heidelberg. 2. Andrychowicz, M. et al. (2017). Hindsight experience replay. arXiv:1707.01495 [cs.LG]. 3. Arnold, A., Nallapati, R., & Cohen, W. W. (2007, October). A comparative study of methods for transductive transfer learning. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on (pp. 77-82). IEEE. 4. Bahadori, M. T., Liu, Y., & Zhang, D. (2014). A general framework for scalable transductive transfer learning. Knowledge and information systems, 38(1), 61-83. 5. Bojarski, M. et al. (2016). End to end learning for self-driving cars. arXiv:1604.07316 [cs.CV]. 6. Brunn, P. (1996). Robot collision avoidance. Industrial Robot: An International Journal, 23(1), 27-33. 7. Casanova, D., Tardioli, C., & Lemaître, A. (2014). Space debris collision avoidance using a three-filter sequence. Monthly Notices of the Royal Astronomical Society, 442(4), 3235-3242. 8. Chen, C. (2012). Building cellular self-organizing system (CSO): a behavior regulation based approach. University of Southern California. 9. Chen, C., & Jin, Y. (2011, January). A behavior based approach to cellular self- organizing systems design. In ASME 2011 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference(pp. 95-107). American Society of Mechanical Engineers. 10. Chiang, W. (2012). A meta-interaction model for designing cellular self-organizing systems. University of Southern California. 11. Chiang, W., & Jin, Y. (2011, January). Toward a meta-model of behavioral interaction for designing complex adaptive systems. In ASME 2011 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (pp. 1077-1088). American Society of Mechanical Engineers. 112 12. Chiang, W., & Jin, Y. (2012, August). Design of Cellular Self-Organizing Systems. In ASME 2012 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (pp. 511-521). American Society of Mechanical Engineers. 13. Chen, J. X. (2016). The evolution of computing: AlphaGo. Computing in Science & Engineering, 18(4), 4-7. 14. Churchland, P. S., & Sejnowski, T. J. (2016). The computational brain. MIT press. 15. Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., & Andrew, N. (2013, February). Deep learning with COTS HPC systems. In International Conference on Machine Learning (pp. 1337-1345). 16. Chen, Y. F., Liu, M., Everett, M., & How, J. P. (2017, May). Decentralized non- communicating multiagent collision avoidance with deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on (pp. 285- 292). IEEE. 17. Daniel, K., Nash, A., Koenig, S., & Felner, A. (2010). Theta*: Any-angle path planning on grids. Journal of Artificial Intelligence Research, 39, 533-579. 18. Dean, J. et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223-1231). 19. Dieleman, S., & Schrauwen, B. (2014, May). End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 6964-6968). IEEE. 20. Ding, Z., Nasrabadi, N. M., & Fu, Y. (2016, March). Task-driven deep transfer learning for image classification. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 2414-2418). IEEE. 21. Eleftheria, E., Apostolos, P., & Markos, V. (2016). Statistical analysis of ship accidents and review of safety level. Safety science, 85, 282-292. 22. Fahimi, F., Nataraj, C., & Ashrafiuon, H. (2009). Real-time obstacle avoidance for multiple mobile robots. Robotica, 27(2), 189-198. 23. Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). ACM. 24. Frommberger, L. (2008). Learning to behave in space: A qualitative spatial representation for robot navigation with reinforcement learning. International Journal on Artificial Intelligence Tools, 17(03), 465-482. 113 25. Fujii, T., Arai, Y., Asama, H., & Endo, I. (1998, May). Multilayered reinforcement learning for complicated collision avoidance problems. In Robotics and Automation, 1998. Proceedings. 1998 IEEE International Conference on (Vol. 3, pp. 2186-2191). IEEE. 26. Zouein, G., Chen, C., & Jin, Y. (2011). Create adaptive systems through “DNA” guided cellular formation. In Design Creativity 2010 (pp. 149-156). Springer, London. 27. Goerlandt, F., & Kujala, P. (2014). On the reliability and validity of ship–ship collision risk analysis in light of different perspectives on risk. Safety science, 62, 348-365. 28. Hameed, S., & Hasan, O. (2016, May). Towards autonomous collision avoidance in surgical robots using image segmentation and genetic algorithms. In Region 10 Symposium (TENSYMP), 2016 IEEE (pp. 266-270). IEEE. 29. Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2), 100-107. 30. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. 31. Hourtash, A. M., Hingwe, P., Schena, B. M., & Devengenzo, R. L. (2016). U.S. Patent No. 9,492,235. Washington, DC: U.S. Patent and Trademark Office. 32. Jin,Y . Liu, X. Williams, E. “Intelligent Situation Assessment and Collision Avoidance – Report 1” Submitted to MTI. 2018. 33. Karaman, S., & Frazzoli, E. (2010). Incremental sampling-based algorithms for optimal motion planning. Robotics Science and Systems VI, 104, 2. 34. Keller, J., Thakur, D., Gallier, J., & Kumar, V. (2016, June). Obstacle avoidance and path intersection validation for UAS: A B-spline approach. In Unmanned Aircraft Systems (ICUAS), 2016 International Conference on (pp. 420-429). IEEE. 35. Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. In Autonomous robot vehicles(pp. 396-404). Springer, New York, NY. 36. Koenig, S., & Likhachev, M. (2005). Fast replanning for navigation in unknown terrain. IEEE Transactions on Robotics, 21(3), 354-363. 37. Koyama, T., & Yan, J. (1987). On the Design of the Marine Traffic Control System. Journal of the Society of Naval Architects of Japan, 1987(162), 183-192. 38. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). 114 39. Kuderer, M., Gulati, S., & Burgard, W. (2015, May). Learning driving styles for autonomous vehicles from demonstration. IEEE International Conference on Robotics and Automation (pp.2641-2646), 2015 40. Le, Q. V. (2013, May). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 8595-8598). IEEE. 41. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551. 42. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. 43. Liu, X. and Jin, Y (2018) Design of Transfer Reinforcement Learning Mechanisms for Autonomous Collision Avoidance. In Design Computing and Cognition'18. 44. Liu, X. and Jin, Y. (2018) Design of transfer reinforcement learning under low task similarity. In ASME 2018 international design engineering technical conferences and computers and information in engineering conference. American Society of Mechanical Engineers. 45. Machado, T., Malheiro, T., Monteiro, S., Erlhagen, W., & Bicho, E. (2016, May). Multi-constrained joint transportation tasks by teams of autonomous mobile robots using a dynamical systems approach. In Robotics and Automation (ICRA), 2016 IEEE International Conference on (pp. 3111-3117). IEEE. 46. March, J. G. (1991). Exploration and exploitation in organizational learning. Organization science, 2(1), 71-87. 47. Mastellone, S., Stipanović, D. M., Graunke, C. R., Intlekofer, K. A., & Spong, M. W. (2008). Formation control and collision avoidance for multi-agent non-holonomic systems: Theory and experiments. The International Journal of Robotics Research, 27(1), 107-126. 48. Matarić, M. J. (1997). Reinforcement learning in the multi-robot domain. In Robot colonies (pp. 73-83). Springer, Boston, MA. 49. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. 50. Mukhtar, A., Xia, L., & Tang, T. B. (2015). Vehicle detection techniques for collision avoidance systems: A review. IEEE Transactions on Intelligent Transportation Systems, 16(5), 2318-2338. 115 51. Ohn-Bar, E., & Trivedi, M. M. (2016). Looking at humans in the age of self-driving and highly automated vehicles. IEEE Transactions on Intelligent Vehicles, 1(1), 90- 104. 52. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10), 1345-1359. 53. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 54. Shiomi, M., Zanlungo, F., Hayashi, K., & Kanda, T. (2014). Towards a socially acceptable collision avoidance for a mobile robot navigating among pedestrians using a pedestrian model. International Journal of Social Robotics, 6(3), 443-455. 55. Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484-489. 56. Stentz, A. (1994, May). Optimal and efficient path planning for partially-known environments. In Robotics and Automation, 1994. Proceedings., 1994 IEEE International Conference on(pp. 3310-3317). IEEE. 57. Stentz, A. (1995, August). The focussed D* algorithm for real-time replanning. In IJCAI (Vol. 95, pp. 1652-1659). 58. Taylor, M. E., & Stone, P. (2007, June). Cross-domain transfer for reinforcement learning. In Proceedings of the 24th international conference on Machine learning (pp. 879-886). ACM. 59. Torrey, L., Shavlik, J., Walker, T., & Maclin, R. (2006, September). Skill acquisition via transfer learning and advice taking. In European Conference on Machine Learning (pp. 425-436). Springer, Berlin, Heidelberg. 60. Van Hasselt, H., Guez, A., & Silver, D. (2016, February). Deep Reinforcement Learning with Double Q-Learning. In AAAI (Vol. 16, pp. 2094-2100). 61. Wang, F. Y., Zhang, J. J., Zheng, X., Wang, X., Yuan, Y., Dai, X., ... & Yang, L. (2016). Where does AlphaGo go: from Church-Turing thesis to AlphaGo thesis and beyond. IEEE/CAA Journal of Automatica Sinica, 3(2), 113-120. 62. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 63. Watkins, C. (1989). Learning from delayed rewards (Doctoral dissertation, King’s College, Cambridge). 116 64. Yang, I. B., Na, S. G., & Heo, H. (2017). Intelligent algorithm based on support vector data description for automotive collision avoidance system. International journal of automotive technology, 18(1), 69-77. 65. Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320-3328). 66. Zou, X., Alexander, R., & McDermid, J. (2016, June). On the validation of a UAV collision avoidance system developed by model-based optimization: Challenges and a tentative partial solution. In Dependable Systems and Networks Workshop, 2016 46th Annual IEEE/IFIP International Conference on (pp. 192-199). IEEE.
Abstract (if available)
Abstract
In order to build a collision avoidance (CA) system with high resilience, robustness and adaptability, learning capability is required for the system to function in an unknown and dynamic environment. One of the major limitations of current machine learning systems is that the learning agent can only function well in its own domain and fails to solve more challenging but similar tasks, which makes it crucial to build a transfer mechanism for the system to utilize previous experience. In order to solve a complex CA task which includes random and dynamically moving obstacles, a deep reinforcement learning mechanism has been implemented first to solve simple CA tasks. A belief-based transfer reinforcement learning (TRL) approach is then proposed to efficiently balance exploration and exploitation in the target tasks given the expert guidance. The approach features two key concepts: the transfer belief that signifies the likelihood or the probability of the agent choosing the expert suggested action (i.e., transfer action)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large-scale path planning and maneuvering with local information for autonomous systems
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
PDF
Emotional appraisal in deep reinforcement learning
PDF
Visual knowledge transfer with deep learning techniques
PDF
Dynamic social structuring in cellular self-organizing systems
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Learning and control for wireless networks via graph signal processing
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Scalable machine learning algorithms for item recommendation
PDF
Modeling, analysis and experimental validation of flexible rotor systems with water-lubricated rubber bearings
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Learning from planners to enable new robot capabilities
PDF
Scaling robot learning with skills
PDF
Machine learning paradigms for behavioral coding
PDF
Contingency handling in mission planning for multi-robot teams
PDF
Multimodal representation learning of affective behavior
PDF
Computer aided visual analogy support (CAVAS) for engineering design
PDF
Active delay output feedback control for high performance flexible servo systems
PDF
Leveraging structure for learning robot control and reactive planning
Asset Metadata
Creator
Liu, Xiongqing
(author)
Core Title
Transfer reinforcement learning for autonomous collision avoidance
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Mechanical Engineering
Publication Date
08/15/2019
Defense Date
04/30/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
collision avoidance,deep reinforcement learning,machine learning,OAI-PMH Harvest,transfer learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jin, Yan (
committee chair
), Huang, Qiang (
committee member
), Shiflett, Geoffrey (
committee member
)
Creator Email
liuxiongqing@gmail.com,xiongqil@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-211074
Unique identifier
UC11663109
Identifier
etd-LiuXiongqi-7799.pdf (filename),usctheses-c89-211074 (legacy record id)
Legacy Identifier
etd-LiuXiongqi-7799.pdf
Dmrecord
211074
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Xiongqing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
collision avoidance
deep reinforcement learning
machine learning
transfer learning