Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robot life-long task learning from human demonstrations: a Bayesian approach
(USC Thesis Other)
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBOT LIFE-LONG TASK LEARNING FROM HUMAN DEMONSTRATIONS: A BAYESIAN APPROACH by Nathan Koenig Ph.D. Dissertation May 2013 Guidance Committee Maja Matari c (Chairperson) Gaurav Sukhatme Rahul Jain (External Member) Table of Contents List of Figures v List of Algorithms viii Abstract ix Chapter 1: Introduction 1 1.1 Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Human-Robot Communication . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Life-Long Robot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Background and Related Work 10 2.1 Manual Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Task-Level Robot Control . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Learning from Demonstration . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Demonstration Approaches . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Policy Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Life-Long Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3: Task Representation 20 3.1 Task Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Features and Actions . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Range of Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 In uence Diagram Representation . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 In uence Diagram Updating . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Hierarchical In uence Diagrams . . . . . . . . . . . . . . . . . . . 31 3.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Limitations of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . 35 ii 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4: Life-long Robot Learning 38 4.1 L3D Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Parametrized Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.2 Hierarchical Parametrized Tasks . . . . . . . . . . . . . . . . . . 44 4.2 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 During teaching . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 During Task Execution . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 5: Teaching Interfaces 51 5.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Role of the Instructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Role of the Student . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Human to Robot Knowledge Transfer . . . . . . . . . . . . . . . . . . . 54 5.4.1 Human-Robot Communication . . . . . . . . . . . . . . . . . . . 54 5.5 Graphical Interface: First Generation . . . . . . . . . . . . . . . . . . . 59 5.6 Graphical Interface: Second Generation . . . . . . . . . . . . . . . . . . 60 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 6: Experimental Validation 64 6.1 Study 1: Visibility During Teaching . . . . . . . . . . . . . . . . . . . . 66 6.1.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1.4 Solving Towers Of Hanoi . . . . . . . . . . . . . . . . . . . . . . 74 6.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Study 2: Improved Communication Using Sounds . . . . . . . . . . . . . 80 6.2.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 Study 3: Simulation-Based Teaching . . . . . . . . . . . . . . . . . . . . 87 6.3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 Study 4: Skill Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 iii Chapter 7: Summary 103 7.1 Life-Long Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2 In uence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3 Simulation-based Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5.1 Task Classication . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5.2 On-line Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bibliography 108 iv List of Figures 1.1 A selection of mobile manipulators: (a) PR2 from Willow Garage, (b) HERB from Intel Research, and (c) Rosie from TUM . . . . . . . . . . . 2 1.2 The ow of information in four methods of robot programming. The man- ual programming approach (a) relies solely on the designer; the expert system approach (b) uses an external expert as the source of knowledge; the imitation learning approach (c) generates a control policy strictly from observations; and the learning from demonstration approach (d) utilizes both knowledge from an expert and observations made by the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 Venn diagram of the components of Bayesian and Decision networks. . . 22 3.2 Range of tasks suitable for representation by in uence diagrams (IDs), indicated by the green section. The two red sections highlight other areas of research focused on learning from demonstration. While IDs do not cover the entire task space, they do oer a method to represent tasks that have received little attention from the LfD community to date. . . . . . 25 3.3 A simple in uence diagram converted to a Bayesian network. The value and decision nodes of an in uence diagram are combined to form a special chance node depicted by rectangle with rounded corners. . . . . . . . . . 29 3.4 Hierarchical structuring of in uence diagrams, where tasks reference other self-contained tasks. An in uence diagram representation of a task may be reference multiple times from dierent diagrams. . . . . . . . . . . . 32 4.1 Collaboration diagram that shows how a single command is sent to the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 v 4.2 Parametrization of a fetch task, including the four learning steps that produce the nal in uence diagram (ID). Step (a) is the initial ID, (b) the ID converted to a Bayes network, (c) the contents of the ID: the structure and parameters following structural EM, and (d) the nal ID converted back from the Bayes network format. . . . . . . . . . . . . . . 43 4.3 Instances of action series of various length from a demonstration set in- volving 31 instructors each who provided on average 77 instructions. . . 45 5.1 Learning from demonstration components. In general, multiple demon- strations, possibly from many dierent instructors, are required before a robot can reproduce a task. . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Complete framework for learning from demonstrations. Feedback from the robot to the instructor is an optional component. Instructors can pro- vide more demonstrations to x any ambiguities or errors in the learned task model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 The language used by instructors to pass teaching commands to a robot student. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 The web-based survey interface used by participants in the Mechanical Turk audio cue validation study. An audio le is played, and the partic- ipant must select what they believe the sound expresses. . . . . . . . . . 56 5.5 Results from the Mechanical Turk sound survey. Each plot shows the expression histogram for the (a) success, (b) error, and (c) acknowledge sound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.6 The web-based GUI designed to facilitate LfD. Each box is a self-contained widget that can be customized. Additional widgets are easily added, which allows for greater exibility. . . . . . . . . . . . . . . . . . . . . . 58 5.7 Simulated teaching interface. 1. Robot camera view, 2. Contents of saucepan, 3. Contents of bowl, 4. Left gripper, 5. Right gripper, 6. Robot feedback, 7. Instruction interface, 8. Change view left/right, 9. Demonstration complete. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.1 The (a) physical and (b) simulator visualization versions of the Willow Garage PR2 robot and the Towers of Hanoi Puzzle. The colored balls in (b) indicate the positions of the three disks and the pegs. . . . . . . . . 69 6.2 Nokia N810 Internet tablet displaying the Towers of Hanoi GUI. . . . . 70 vi 6.3 Box plots of the duration times and command counts between collocated and separated instructors. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 Enumeration of the possible states for the three-disk version of Towers of Hanoi. The optimal solution path is shown in green. Two alternative solutions, given during demonstrations, are shown in teal and orange dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5 The (a) Bayesian network and (b) in uence diagram (b) learned from demonstration data for the Towers of Hanoi puzzle. . . . . . . . . . . . . 76 6.6 PR2 in the simulated box sorting environment. The PR2 is shown at a distance from the table in order to make the bins clearly visible. . . . . 82 6.7 A kit sheet that describe what boxes belong in each bin. . . . . . . . . . 83 6.8 Interaction plots of sounds and robot visibility on (a) the number of bad commands, and (b) the total number of commands . . . . . . . . . . . . 85 6.9 Simulated kitchen environment with ingredients and utensils necessary to cook risotto. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.10 Mushroom risotto recipe used in the study. . . . . . . . . . . . . . . . . 90 6.11 The rst twelve instructions from all the teachers. Each edge is labeled with the number of occurrences. The highlighted path is the set of in- structions most instructors provided to the robot. . . . . . . . . . . . . . 92 6.12 Abbreviated hierarchical in uence diagrams for the risotto task. Each bubble indicates the action taken by the robot. The conditional proba- bility and value tables have been removed for clarity. . . . . . . . . . . . 93 6.13 The kitchen environment and PR2 executing the risotto task. Cups were used as ingredients to reduce grasping errors, and simplify object detec- tion. The stove and sauce pan were also replaced with simpler objects. Additionally, the knife was protected with wood to prevent damage to the robot and people. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.14 Simulated environment that contains a typical kitchen and dining room. 97 vii List of Algorithms 3.1 Structural learning algorithm for in uence diagrams . . . . . . . . . . . 30 4.1 In uence diagram parameter updating . . . . . . . . . . . . . . . . . . . 41 4.2 In uence diagram matching . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 In uence Diagram Identication . . . . . . . . . . . . . . . . . . . . . . 46 viii Abstract Programming a robot to act intelligently is a challenging endeavor that is beyond the skill level of most people. Trained roboticists generally program robots for a single purpose. Enabling robots to be programmed by non-experts and to perform multiple tasks are both open challenges in robotics. The contributions of this work include a framework that allows a robot to learn tasks from demonstrations over the course of its functional lifetime, a task representation that uses Bayesian decision networks, and a method to transfer knowledge between similar tasks. The demonstration framework allows non-experts to demonstrate tasks to the robot in an intuitive manner. In this work, tasks are complex time-extended decision processes that make use of a set of predened basis behaviors for actuator control. Demonstrations from an instructor provide the necessary information for the robot to learn a control policy. An instructor guides the robot through a demonstration using a graphical interface that displays information from the robot and provides an intuitive action-object pairing mechanism to issue commands to the robot. Each task is represented by an in uence diagram, a generalization of Bayesian net- works. The networks are human readable, compact, and have a simple renement pro- cess. They are not subject to an exponential growth in states or in branches, and can be combined hierarchically, allowing for complex task models. Data from task demon- strations are used to learn the structure and utility functions of an in uence diagram. ix A score-based learning algorithm is used to search through potential networks in order to nd an optimal structure. Both the means by which demonstrations are provided to the robot and the learned tasks are validated. Dierent communication modalities and environmental factors are analyzed in a set of user studies. The studies feature both engineer and non-engineer users instructing the Willow Garage PR2 on four tasks: Tower's of Hanoi, box sorting, cooking risotto, and table setting. The results validate that the approach enables the robot to learn complex tasks from a variety of teachers, rening those tasks during on-line performance, successfully completing the tasks in dierent environments, and transferring knowledge from one task to another. x Chapter 1 Introduction This chapter provides an overview of life-long learning of tasks from hu- man demonstrations, and motivates our approach to user-friendly human-robot knowledge transfer. We introduce our approach to task representation, life-long learning, and human-robot communication. The contributions and outline of the dissertation are given at the end of this chapter. People are an extremely intelligent and adaptable species. Every day we gain new experiences and knowledge through interactions in various contexts. Our sophisticated brains allow us to lter out noise, focus on relevant information, and function in a complex society. Some of these skills we are born with [Bremner, 1994], some are acquired through formal education, and a large portion is gained through informal learning [Cross, 2006]. Articial systems start out with the knowledge and skills given to them by human de- signers. Although progress has been made toward achieving greater machine autonomy, we have not yet achieved the prediction that \machines will be capable, within twenty years, of doing any work a man can do," made by Herbert Simon in 1965. Forty-seven years later, most robots can perform relatively few tasks, and those only in constrained environments. 1 (a) (b) (c) Figure 1.1: A selection of mobile manipulators: (a) PR2 from Willow Garage, (b) HERB from Intel Research, and (c) Rosie from TUM There are many factors that contribute to robots not living up to Simon's prediction. One of the primary factors is the diculty associated with programming a robot to perform a task. Related is the problem of reusing knowledge in a manner that is conducive to generating ever more complex tasks without starting from square one. We tackle these issues using learning from demonstration (LfD) as a method to intuitively program a robot, and a hierarchical probabilistic task representation that lends itself to knowledge reuse. Both LfD and the ability to reuse prior knowledge are important features in a life- long learning robot. The tasks that t within the scope of our approach are time extended series of behaviors, such as setting a table. We use a graphical interface that acts as the medium through which communication between a person and robot is conducted during a task demonstration. The information gathered by the robot during the demonstration is represented as an in uence diagram (ID), a probabilistic representation of a decision process. Additional demonstrations utilize stored IDs to facilitate learning of complex tasks. This approach is termed life-long learning from demonstration, or L3D. Our approach is validated using four tasks: Towers of Hanoi puzzle, box sorting, cooking risotto, and table setting. 2 This dissertation focuses on mobile manipulators, one or more robotic arms mounted on a mobile base, such as those in Figure 1.1, situated in home and oce environments. Tasks in such environments include object retrieval, oce mail delivery, house-sitting, and security monitoring. For such mobile manipulator robots to eventually become part of our everyday lives, they must be capable of meeting our needs and interacting with us in natural ways in various types of home and oce environments. For that to be possible, we believe that robots must be able to be easy to customize and able to acquire new information over the course of their lifetimes. 1.1 Knowledge Transfer Almost all commercially produced robots today, and many research robots as well, rely on manual programming to produce useful behavior. This method, while proven through years of use, is dicult and time consuming. Trained professionals with years of experience are required to generate robot control programs. Machine learning oers an alternative to robot programming. While implementing a learning algorithm requires signicant experience, the application of the algorithm ideally requires minimal to no experience. As multi-purpose robots become commer- cially available, an approach to customizing robots that requires little experience is needed. Consumers will have dierent needs and environments in which robots should operate. Manually programming to handle all of these situations is prohibitively com- plex. Furthermore, end user should be able to instruct robots in a convenient, intuitive, and non-technical manner. Robots capable of being programmed and interacted with in an intuitive fashion will be highly customizable and more likely to be adopted by non-technical users. Therefore our approach combines these two notions into a life-long learning approach that uses learning by demonstration. 3 (a) Manual Programming (b) Expert System (c) Imitation Learning (d) Learning from Demonstration Figure 1.2: The ow of information in four methods of robot programming. The manual programming approach (a) relies solely on the designer; the expert system approach (b) uses an external expert as the source of knowledge; the imitation learning approach (c) generates a control policy strictly from observations; and the learning from demonstra- tion approach (d) utilizes both knowledge from an expert and observations made by the robot. Programming a robot can be seen as transferring knowledge from a person to a machine. Prior work on automatically generating robot control policies from human input has included expert systems [Visinsky et al., 1994, Zixing, 1988], imitation learn- ing [Ramesh and Matari c, 2002a], and learning from demonstration [Billard et al., 2008]. These approaches have been based o of insights into how the human mind works, at various levels, including mirror neurons [Rizzolatti and Sinigaglia, 2008], social cogni- tion [Breazeal et al., 2005], and empathy [Preston and de Waal, 2002]. Imbuing our natural ability to learn from others into a robot is still well beyond the reach of current technology. To make the problem more tractable, we use an ap- proach that takes advantage of the available complementary strengths of both humans and machines. Specically, we utilize learning from demonstration (LfD) to facilitate human-robot knowledge transfer. LfD takes advantage of the available strengths of both humans and machines. An instructor provides information in an intuitive man- ner through demonstrations, and the robot generates a task representation based on observations of its internal state and the state of the world. 4 LfD relies on a human teacher to actively guide a robot student through a task, as shown in Figure 1.2d. Imitation learning and LfD are commonly used interchangeably, however there is an important dierence between the two. LfD uses a deliberate and active teacher that provides task demonstrations to a robot, while imitation learning relies on the robot's remote observations of a person who may not be purposefully acting as a teacher. LfD is a social activity that relies on communication between the instructor and the student. To maximize learning, the instructor should understand what is going on in the mind of the student, and the student should convey this information through verbal and non-verbal communication. Throughout our lives we give and receive demonstrations. Parents provide their children with numerous demonstrations, such as how to swing a bat, ride a bike, and tie a neck-tie. We continue to learn skills in this fashion throughout our lives. Since this practice is common, we can assume that the majority of people are capable of demonstrating tasks to another person or a robot. The robot, in turn, must perceive and understand the information, do so in a social setting, and relay information back to the teacher in a meaningful manner. In this way, by using LfD, the complexity associated with programming a robot, and the training required, are reduced. Robot programming and teaching thus becomes accessible to a much larger population, if robots can be made into capable students. 1.2 Human-Robot Communication When people interact with each one another, multiple forms of communication are used, including speech, gestures, and touch. In the LfD context, the choice of communication modality is especially important. The robot should be able to receive instructions in a reliable manner, communicate back necessary information, and expect the instructor to provide an accurate demonstration. Common methods of teaching a robot involve the 5 use tele-operation or physical manipulation. Both feature advantages and disadvantages with respect to the type and amount of communication used and the ability of an instructor to provide accurate demonstrations. We developed an approach that uses a graphical user interface (GUI) as the primary mode of communication between an instructor and a robot. A GUI allows for straight- forward and intuitive display of sensor data from the robot, which in turn allows an instructor to understand what a robot is \seeing." In addition, the GUI provides an intuitive mechanism for issuing instructions to a robot. Finally, GUIs can be customized for individual users, though such methods are beyond the scope of this dissertation. 1.3 Life-Long Robot Learning Enabling a robot to learn a single task is the rst step toward our goal of a general purpose and customizable robot discussed in Section 1.1, above. In addition, a robot must be capable of acquiring new knowledge over the course of its functional lifetime. This ability radically changes the typical role of a robot, from a single-purpose ma- chine to a device capable of accomplishing tasks well beyond the designers' original implementation. In order to facilitate life-long learning, a robot requires memory and the ability to eciently and eectively recall past experiences and store new knowledge. A task learned by a robot one year ago can be just as useful today as it was then. Memory is comparatively inexpensive, so in this work we assume that forgetting is not required, if the learned information can be stored in some sparse and readily retrievable fashion. A compact and high-level representation capable of task reuse is used to store learned tasks. This allows a robot to store and access numerous tasks. The task representation must be amenable to the LfD paradigm, capable of repre- senting long-term tasks, easily adaptable, and hierarchical so as to facilitate combining 6 multiple tasks. Based on a review of the existing techniques, we chose to use a Bayesian model called in uence diagrams [Shachter, 1988]. Whereas a Bayesian network encodes conditional independencies between random variables, an in uence diagram incorpo- rates additional components to model decisions and their utilities. With this approach we gain the benets of probabilistic reasoning from Bayesian networks and the ability to make decisions in one unied model. Using information from demonstrations and an o-line score-based learning algorithm, we can eectively generate an abstract represen- tation of a task that meets all of our above-stated requirements for a life-long learning robot. To evaluate the approach we developed, we executed four studies, each involving a dierent task to be learned through GUI input to the robot: the Towers of Hanoi puzzle, box sorting, cooking risotto, and table setting. Each was conducted as a controlled user study. The Towers of Hanoi study analyzed the eect visual obstruction of the robot has on LfD teaching performance, as described in Section 6.1. Box sorting incorporated audio cues to improve communication between the instructor and robot, as described in Section 6.2. The risotto task integrated the graphical teaching interface with the use of a robot simulation, to reduce instruction time and simplify error correction, as described in Section 6.3. The IDs learned from risotto were executed autonomously on a Willow Garage PR2 robot. Finally, the table setting task demonstrated the ability to reuse prior knowledge while learning a new task, as described in Section 6.4. 1.4 Dissertation Contributions The work in this dissertationaddresses a number of challenges involved in life-long learning though human demonstrations. We present a robot-independent approach for human-robot communication, a method to represent, store, and share task information, and analysis of environmental factors that aect the teaching process. 7 The following are the main contributions of this dissertation: 1. Application of a Bayesian approach to task representation. In uence diagrams are used to compactly store and probabilistically reason about task information. These networks provide a convenient mechanism to share knowledge between robots, and allow for hierarchical task structuring. 2. Implementation of a robot independent teaching tool. A graphical user interface, designed to be robot-agnostic, provides an intuitive mechanism to guide a robot through demonstrations. 3. Development of a life-long learning framework that allows a robot to continually acquire and use new information. Memory management and knowledge reuse rely on in uence diagrams, and knowledge acquisition depends strongly on the communication mechanisms. The following are secondary contributions: 1. Human-robot interaction studies, which oer insights into how people teach a robot and the types of communication necessary for knowledge transfer. 2. Development of an open-source library for Bayesian networks and in uence dia- grams. This library, called Engine9 (available at https://launchpad.net/engine9), contains algorithms for Bayesian network and in uence diagram inference and learning. 3. Development of Gazebo, open-source robot simulator capable of simulating a wide variety of robot and environments. Gazebo supplied a rich, interactive, fast, and predictable environment in which to teach robots. 8 1.5 Outline The remainder of this document is organized as follows: Chapter 2 presents prior work related to the dissertation research. Chapter 3 describes our L3D approach to learning from demonstration. Chapter 4 discusses modications to in uence diagrams that facilitate learning high-level tasks from demonstration data. Chapter 5 presents our complete L3D approach. Chapter 6 contains evaluation of L3D and results from four user studies. Chapter 7 summarizes the dissertation. 9 Chapter 2 Background and Related Work This chapter surveys prior work related to the major areas of this disserta- tion. The chapter is accordingly divided into four sections: Manual Program- ming, Task-Level Robot Control, Learning from Demonstration, and Life-Long Learning. 2.1 Manual Programming Programming robots manually is typically used to implement custom control policies. Programming languages specically designed for robotic systems have been developed over the years. Of particular importance is the ability to handle dynamic environments, gracefully recover from errors, and monitor devices. Languages in this category include RAPs [Firby, 1989], TCA [Simmons, 1994], PRS [George and Lanksy, 1987], ESL [Gat, 1996], and Maestro [Cost-Maniere and Simmons, 2000]. In addition to programming languages, control architectures have been designed to organize and synchronize multiple robot behaviors into a unied framework. Some of the common architectures include Subsumption [Brooks, 1986], SSS [Connell, 1992], and AuRA [Arkin and Balch, 1997]. These frameworks allow robots to act intelligently by, for example, navigating environments while interacting with dynamic objects. 10 For the most part, custom programming languages and control architectures have been surpassed by robot software development kits (SDK). One of the earliest SDKs was Saphira [Konolige and Myers, 1998] developed for use on the Flakey and Pioneer robots for behavior-based control [Arkin, 1998]. While Saphira used PRS, most current SDKs use general-purpose programming languages such as C, Python, or Java. Some of the more common SDKs include Player [Gerkey et al., 2003], ROS [Quigley et al., 2009], Aria [Whitbrook, 2010], and Orocos [Soetens, 2006]. These manual approaches to robot programming form the basis on which robot soft- ware and applications have been developed. SDKs have made it straightforward to share code and develop very complex applications. However, they are still tailored for people who have a signicant amount of training and experience with robots. For the general public, the intended end-users of service robots of various kinds, these approaches to robot programming are infeasible. 2.2 Task-Level Robot Control The previous section discussed manual methods of programming a robot that require the developer to have intimate knowledge of the robot's capabilities. Task level robot control abstracts away these details through the use of symbolic states and actions. Instead of directly reading from sensors or controlling motors, a developer can, for example, specify objects for the robot to pick up or a location to which the robot should move. Most work in task-level programming has focused on behavior sequences to achieve complex tasks [Lozano-P erez et al., 1989], [Lozano-P erez, 1982], [Schrott, 1991], [Narasimhan, 1994], [Segre, 1988], [Chen, 2005]. The decision process that the robot must solve falls into the space of EXP-hard and PSPACE-complete problems due to the large state space. However, abstract states, such as on top of or next to, bypass 11 this complexity problem by alleviating uncertainty in the system. Abstract states en- capsulate a wide range of possible concrete states, such as absolute pose of an object. As a result, the size of the state space is reduced and greater uncertainty about the true concrete state is permitted. Most task-level robot control approaches in the literature attempt to be platform- independent, i.e., robot-agnostic. However, this goal is generally not fully realized. In some cases, the architecture is too tightly coupled to a specic robot. For exam- ple, in Lozano-P erez et al. [1989] it is assumed the robot utilizes a jaw-like gripper for pick-and-place tasks. In earlier work, the authors' system generated a series of commands in a specic robot programming language, which again precludes robot ag- nosticism [Lozano-P erez, 1982]. Gerhard Schrott's [Schrott, 1991] task level control of a PUMA arm required the use of the VAL-II language [Shimano et al., 1984], however these requirements were self-contained to a small portion of the architecture and could potentially be replaced without signicant eort. The work in this dissertation utilizes similar task-level specications for robot control in order to achieve complex tasks. We rely on symbolic states and actions that allow a robot to perform real-world tasks while avoiding planning complexity. The primary dierence between the work in this dissertation and previous work is that we do not require specic robot hardware and software capabilities. State and action symbols are generated by the system designer. We directly use the symbols in the learning process, and represent tasks as a graphical model. 2.3 Learning from Demonstration As discussed above, manual programming is challenging and requires robotics exper- tise. Learning techniques can be applied to automate the generation of control policies, however the state and action space for a robot can be extremely large [Schaal, 1999]. 12 Learning from demonstration (LfD) aims to alleviate this problem by leveraging human knowledge to program a robot through the use of guided examples [Billard et al., 2008]. LfD techniques can be organized along two dimensions: how demonstrations are provided, and how policies are generated from the demonstrations. Guiding a robot through a demonstration can be accomplished using a variety of methods, including manual manipulation [Hersch et al., 2008], teleoperation via joysticks [Chernova and Veloso, 2007, Grollman and Jenkins, 2007b], graphical interfaces [Chernova and Veloso, 2008], external observations [Pollard and Hodgins, 2002, Schaal et al., 2004], or from sensors placed on the instructor [Ijspeert et al., 2002, Nakanishi et al., 2004]. Each of these approaches is discussed in the following section, followed by a discussion of policy generation in the LfD context. 2.3.1 Demonstration Approaches Joysticks provide a convenient and intuitive method for controlling simple mobile robots. In these situations the recorded actions map directly to the robot's capabilities, result- ing in a reduction of correspondence errors. However, joysticks become intractable for humanoid robots, or other robots with many degrees of freedom (DOF). Manual ma- nipulation allows the instructor to physically move the robot through a demonstration. This approach oers the same correspondence benets as joysticks, but allows an in- structor to easily interact with robots that have many DOFs. The tedious nature of manually moving a robot is compounded with confusion associated with the mismatch between human DOF and robot DOF [Weiss et al., 2009]. Both joystick and manual manipulation also suer from little feedback provided to the instructor. The work presented in this dissertation relies on a graphical interface to provide ac- tions to the robot and relay state information between the robot and instructor [Koenig et al., 2010]. A graphical interface makes it possible to teach a wide variety of tasks to 13 both simple and complex robots. An additional benet of a customizable robot interface is a greater degree of information sharing that can take place between the robot and the human instructor. This comes at the cost of designing the graphical interface that is user friendly [Myers, 1990] and simple to extend and adapt to various robot and task conditions. External observation of an instructor relies on sensors, typically cameras, that are not located on the robot. In most cases visual markers are placed in the environment, and have been used to learn tasks ranging from pole balancing [Atkeson and Schaal, 1997] to air hockey [Bentivegna et al., 2002] and human motion [Billard and Matari c, 2001, Ramesh and Matari c, 2002b]. These markers are tracked in biologically derived frameworks [Chella et al., 2006] that generally rely on a manually engineered mapping from instructor states and actions to those usable by the robot. When working with humanoid or anthropomorphic robots, sensors placed directly on the human provide accurate measurements. This approach has been used to teach drumming [Ijspeert et al., 2002] and walking patterns [Nakanishi et al., 2004] to highly articulated robots. Typically a sensor suit is worn by the instructor, which is expensive, dicult to set up, and cumbersome to wear. Both marker-based systems and sensors on the instructor require manipulation of the environment in a non-trivial manner. These techniques are therefore relegated to lab settings where the environment can be tightly controlled by experts. A goal of this dissertation is to signicantly reduce the constraints on the robot learning system. Robots are rapidly becoming available to the generally public, and we postulate that there will be a need to customize such robots on a person-by-person basis. Learning from demonstration provides a compelling mechanism for robot customization. Most people have little to no robotics experience, and methods that require signicant experience or a complex setup for teaching will be impractical. 14 2.3.2 Policy Generation Given a data set of one or more demonstrations, the robot must learn a policy that maps states to actions. There are three general approaches to policy generation: 1) learn a function that maps observed state to actions, 2) use a transition model from which a system model is derived, and 3) generate a plan that represents a sequence of action that lead from a start state to a goal state. State-action mapping functions are further decomposed into classication or re- gression techniques. Classication techniques generate discrete outcomes by grouping similar input together, while regression techniques produce continuous output. In the domain of classication techniques, Gaussian mixture models have been used in driving and soccer domains [Chernova and Veloso, 2007], decision trees have been used to teach ying in simulation [Sammut et al., 1992], hidden Markov models (HMMs) have been used in manipulation tasks [Hovland et al., 1996], and Bayesian networks, generaliza- tions of HMMs, have been used for navigation tasks [Inamura et al., 1999]. Regression techniques generate a model that takes as input robot state and pro- duces continuous robot actions. Locally weighted regression has been widely used to learn motion strategies for tasks such as walking [Nakanishi et al., 2004] and rhythmic drumming [Ijspeert et al., 2002]. This approach suers from the necessity of retain- ing all data, which is overcome through the use of locally weighted projection regres- sion (LWPR) [Vijayakumar and Schaal, 2000]. LWPR has been successfully applied to learning soccer skills [Grollman and Jenkins, 2007a], air hockey [Bentivegna, 2004], and adaptive motion trajectories [Pastor et al., 2009]. At the opposite end of the spectrum are methods that generate a complete functional model prior to execution. Neural net- works have been used for autonomous driving [Pomerleau, 1991] and for peg-in-hole tasks [Dillmann et al., 1995], among many others. Sparse on-line mixture of Gaussian (SOGP) have also been used to learn Aibo soccer skills [Grollman and Jenkins, 2008]. 15 These regression techniques are best used for motor control and skill learning; they cannot readily be applied to complex and long-term decision processes. System models are often produced through reinforcement learning (RL) [Sutton, 1998]. In most situations, the reward function is manually generated by the user. This method has been successfully applied to manipulation and other tasks [Kaelbling et al., 1996, Peters et al., 2003, Rosenstein and Barto, 2004]. The primary drawback is the need for a reward function that must be specied by an experienced user. This limita- tion is addressed through inverse reinforcement learning, where the reward function is learned [Russell, 1998]. Atkenson and Schaal have applied inverse RL to learn a pen- dulum swing-up task [Atkeson and Schaal, 1997]. An alternative interactive method generates a reward function based on user input [Thomaz and Breazeal, 2006b]. RL requires the system to explore the space of states in order to learn a policy. This re- quirement is especially dicult and time consuming for a robot situated in a social setting or assistive setting with people. The nal category of policy generation produces a plan from states and action parings. Actions in this formulation typically have pre- and post-conditions. This is combined with additional information in the form of intentions or other semantic information provided by the instructor. The technique has been used to learn object manipulation based on observations of a human hand [Kuniyoshi et al., 1994]. Teacher intentions have been incorporated to draw attentions to particular objects [Nicolescu and Matari c, 2003]. Taking this approach one step further, high level annotations have been incorporated that provide information about action ordering during hierarchical task leaning [Garland and Lesh, 2003]. Recent work has utilized text analysis to generate robot control policies [Bollini et al., 2012]. The authors' current approach is limited to recipes, however future improvements could extend the approach to other domains. While the method does not facilitate 16 learning, it could be extended to be complementary to learning from demonstration, such as presented in this dissertation. 2.4 Life-Long Learning Most learning approaches take place within a pre-determined environment, for a specic task, and for a short timespan. The idea that a robot can, and should, continually learn, is not addressed by the majority of the learning research. This is mostly due to the complexities associated with life-long learning, which include how to encode, store, retrieve, reuse, and adapt learned tasks. Thrun and Mitchell were the rst to formally address the challenge of life-ling learn- ing by stating \Humans typically encounter a multitude of control learning problems over their entire lifetime, and so do robots" [Thrun and Mitchell, 1995]. They ap- proached life-long learning using explanation based neural networks (EBNN) that al- lowed prior knowledge to bootstrap the learning process. The EBNN framework is explicitly tied to neural networks, which has been shown to work well for simple control tasks but does not scale well to large decision problems. In the LfD context, a Piagetian approach has been used to learn task plans and incorporate task demonstrations into a knowledge base [Pardowitz and Dillmann, 2007]. Other more recent work has focused on life-long learning in the domain of navigation. Autonomous vehicles face many challenges, including an ever-changing environment due to the changes in season, lighting conditions, and weather. By storing and correlating sensor data over time, a robot is able to build a better representation of the world and improve tasks such as localization and mapping [Churchill and Newman, 2012, Hentschel and Wagner, 2011]. In a similar vein, large multi-modal sensor datasets gathered in controlled environments have been used to identify and mitigate perceptual failures for autonomous ground vehicles [Brunner et al., 2009]. Both of these techniques 17 rely on access to large datasets gathered over long periods of time to improve mobile robot navigation performance. Only recently has life-long learning started to become an active topic of research. This is partially due to the development of reliable hardware and software systems that allow a robot to operate over long periods of time [Meeussen et al., 2011], to the surge in research into human-robot interaction (HRI), and to reusable software becoming available to enable applications such as localization, SLAM, perception algorithms, and simple motion planning. There are potentially many types of life-long learning, including learning a specic task in diverse environments and learning a wide variety of tasks. Additionally, there can be many algorithmic approaches to realize life-long learning. This work denes life-long learning as the ability to learn a wide variety of tasks over time. We make the following assumptions: 1) that accurate training data are provided to the robot via LfD in the form of state and action pairs; 2) that searning takes place at explicit times, and 3) that the robot has time to encode and store this knowledge into a database. This knowledge is later used during both future learning and during task execution. 2.5 Summary This chapter has discussed robot learning methods most related to the work in this dissertation. We believe that customizable robots capable of learning numerous and diverse tasks will be required by the general public in the near future. The approach we developed to address this challenge uses a combination of learning from demonstration and life-long learning. These two broad categories of learning complement each other by providing a convenient mechanism to acquire new knowledge directly from the end user and store this knowledge for future learning instances and task executions. 18 A challenging component of LfD is the method used by instructors to guide a robot through a task. This work aims to make the process of learning accessible to as many non-technical people as possible. Chapter 5 describes existing methods for teaching interfaces and describes our graphical interface developed to simplify the training and reduce the number of teaching demonstrations. Life-long learning has typically been applied to similar or identical tasks, more akin to life-long adaptation. While life-long adaptation is very useful, it does not directly address how a robot may acquire new tasks over time and reuse task knowledge to facili- tate future task learning. Our approach to life-long learning utilizes in uence diagrams, a compact hierarchical task representation, that facilitates learning dierent tasks and sharing knowledge between tasks. The following chapter describes in uence diagrams and how they can be used for life-long learning. 19 Chapter 3 Task Representation The task encoding scheme determines the range of tasks a robot can represent and the type of learning techniques it can consequently employ. This chapter describes in uence diagrams and their ability to model decision processes. Algo- rithms designed to learn the parameters and structure of these graphical models are presented along with examples. The representation used by a robot to internally store knowledge about a task is a key component that determines the range and types of tasks the robot can accomplish. When framed in the life-long learning context, the representation must facilitate the encoding, storing, and referencing a potentially large number of tasks. It must therefore be easily accessible, adaptable, and "light weight" in terms of its memory requirements. These requirements greatly reduce the range of available task representation options. This range is further reduced by considering only representations that handle goal- oriented decision problems that can be taught to a robot through demonstrations. Based on these requirements, we chose in uence diagrams (ID) [Shachter, 1988], known also as decision networks, to represent tasks. An ID is a generalization of a 20 Bayesian network [Pearl, 1988]. The standard structural elements of Bayesian networks are included in an ID, with the addition of decision nodes and values nodes. Decision nodes dene a set of actions that are available for execution, and structurally enforce a time ordering in the network. Value nodes dene a utility function that is associated with a single decision node. This utility function determines what actions are most ap- propriate for execution by following a maximum expected utility criterion. In gure 3.1, each node is pictorially dened by a unique shape. An in uence diagram must be generated from data provided by one or more demon- strations of a task. The learning process includes the determination of the network structure and parameters. One of the contributions of this dissertation is a new a struc- ture and parameter learning algorithm for in uence diagrams that is an adaptation of an algorithm designed for Bayesian network learning. The rest of this chapter is organized as follows. The next section provides denitions of tasks, their components, and the range of tasks captured by our approach. Next the approach to generating in uence diagrams from demonstration data is described. The chapter ends with an overview of the limitations of this ID-based approach, and a summary. 3.1 Task Denition In this work, a task is dened as a time extended decision process composed of a series of actions that act on a set of features. Tasks represent a concrete, goal oriented activity, such as setting a table or fetching objects. People spend a signicant amount of time teaching and learning tasks, which makes learning from demonstration (LfD) well suited to task learning. 21 Figure 3.1: Venn diagram of the components of Bayesian and Decision networks. 3.1.1 Features and Actions In our approach, prior to learning, the robot is endowed with a set of core features that it can recognize in the environment and a set of actions it can perform. Through the learning process, the robot combines these features and actions into a larger set of complex tasks. The following two sections dene features and actions, how the robot acquires each, and how they are used. 3.1.1.1 Features Features are generated from perception algorithms that analyze raw sensor data to produce contextual information about the environment. For instance, one perception algorithm could identify tables, another cups. This type of information greatly reduces the complexity of the learning algorithm by providing only relevant information about 22 the environment. Additionally, features are readily understood by human instructors, while raw sensor data are much more dicult for people to parse. This work relies on the existence of such perception algorithms [Ciocarlie, 2012]; we believe that this reliance is reasonable, as such algorithms are continually being produced and rened by robotics and machine perception experts. The greater the number of features available, the greater the range of tasks the robot can perform. Even with relatively few features, the range of tasks can be quite large, as the sequence of actions using the features can be very complex. We utilize the perception pipeline available in ROS to identify features in the envi- ronment. This pipeline allows the robot to identify many common objects; for example for the kitchen (the domain in one of the evaluation studies in this dissertation), the objects include drink bottles, soup cans, plates, and cups. 3.1.1.2 Actions Actions consist of short-term predened behaviors that control the eectors of a robot. Ideally, each action is self-contained, and produces well dened eects that are applicable to a wide range of tasks. Example actions include moving to a feature and picking up a manipulable feature. Section 4.2 discusses cases where actions do no perform as expected. Actions contain pre- and post-conditions that determine when they may be used and how to determine their success. Typically, only a minimal set of basic actions is needed in order to perform a larger set of tasks. Basic actions can be used in many instances and linked together to achieve complex tasks. Additionally, the set of actions may be expanded by implementing new actions as needed. Extensive prior work on basis behaviors [Matari c, 1997] and motion primi- tives [Schaal et al., 2003] forms the foundation for dening actions of the type used 23 in this work. Over the years, the research conducted on fundamental robot naviga- tion and some manipulation behaviors has transitioned from the research setting into readily available software packages, such as ROS [Quigley et al., 2009]. ROS provides pre-dened behaviors used as input to our learning approach, including behaviors for navigation, picking up and putting down objects, opening and closing doors, and stack- ing objects. 3.1.2 Range of Tasks The range of tasks our approach can learn is limited by (a) physical constraints of the robot's embodiment, including height, degrees of freedom, sensors, eector shape, etc., (b) dened actions and features, and (c) the task representation. Physical constraints limit the type of tasks a particular robot can accomplish, but do not place a hard limit on the overall capabilities. Similarly, actions and features can be expanded as necessary to achieve a greater range of tasks. Task representation, however, is an overarching constraint that denes the true scope of tasks achievable by the approach. The use of in uence diagrams constrains the range of learnable tasks in a number of ways. The rst constraint is the time scale on which in uence diagrams operate. The time necessary to solve an in uence diagram, and thereby generate a control pol- icy, depends on the complexity of the in uence diagram. Our empirical experiments shows that solutions times around 0.1 seconds are expected for in uence diagrams that represent time-extended decision problems, such as setting a table or following a recipe. The second limitation of in uence diagrams is their discrete nature. Decision nodes mark specic points in time when an action must be chosen based on the current state. Furthermore, in uence diagrams are not well suited for encoding continuous functions. This attribute, in conjunction with the solution time scale, places a bound on the types 24 No Manipulation Significant Manipulation Significant Locomotion No Locomotion Object perception and recognition Joint trajectory learning Task Categories Time extended action sequences Motion strategies Figure 3.2: Range of tasks suitable for representation by in uence diagrams (IDs), indicated by the green section. The two red sections highlight other areas of research focused on learning from demonstration. While IDs do not cover the entire task space, they do oer a method to represent tasks that have received little attention from the LfD community to date. 25 of tasks our approach can learn, since we are also constrained by the real-time and continuous requirements of the given task. A nal key constraint on the learnable tasks comes from the nature of the learn- ing approach. Since the approach relies on predened features, which take the form of object recognition and sensor processing algorithms, we rely on the existence of those algorithms. Thus our approach does not include the ability to learn new feature detec- tors. Figure 3.2 categorizes the range of tasks along two axes: manipulation and loco- motion. The green area indicates the focus area of this work. The shape of the green polygon re ects the types of tasks our approach can learn from demonstrations. The two large red areas highlight where a majority of prior work in LfD has been focused. To a lesser extent, others have have applied LfD to generate tasks that overlap with the green section [Nicolescu and Matari c, 2003, Thomaz and Breazeal, 2006b]. The lower left quadrant contains tasks that do not necessarily require the robot to act in the environment, such as object detection and recognition. Most LfD work assumes that these capabilities are provided to the robot. The upper left quadrant contains navigation and locomotion dominant tasks. Some prior work in LfD has focused on movement to specic objects in the environment, or motion strategies. This work overlaps by allowing a teacher to move the robot to places that contain identiable features. The upper right quadrant contains tasks with signicant locomotion and manipu- lation. These are dicult tasks that typically require nely tuned controllers and/or a signicant amount of learning and training data. A limited set of these tasks may be achieved through hierarchical learning. The lower right quadrant contains manipulation heavy tasks such as object grasping and collision free arm navigation. This is also an area that has received signicant 26 attention from the LfD community. Most of the tasks in this quadrant require low-level joint trajectory learning. Our proposed algorithm is not well suited to learning in this space. Instead, it is designed for learning time extended decision processes. 3.2 In uence Diagram Representation A Bayesian network can be used by a robot to decide what task action to perform given information about the world. The drawback of using Bayesian networks for representing tasks is that the decision making process is separate from the Bayesian network. This approach would be cumbersome since it limits Bayesian networks to one-shot decision problems, where information about a decision does not aect future observations. Our approach utilizes in uence diagrams, which incorporate both the probabilistic model and the structure of the decision problem. We have adapted a technique designed to learn the structure of Bayesian networks for use with in uence diagrams. This ap- proach allows a robot to automatically generate in uence diagrams from demonstration data. 3.2.1 Learning 3.2.1.1 Parameter Learning Methods used for parameter learning in Bayesian networks can be extended to in uence diagrams by applying a few modications. Let us assume that the structure of an ID is provided along with a set of data that represent samples from the network. We further assume that the data set contains all the actions taken, and therefore provides complete knowledge about the decision nodes. The semantics of IDs allow us to make a few novel observations that simplify the process of parameter learning. First, a decision node is only a list of possible actions. A 27 given data set will include those actions, and therefore provides complete information about decision nodes. A value node contains a function that maps parent nodes to utility. This function can be chosen arbitrarily, as long as the resulting network faithfully matches the data. For most cases, a table of weighted frequency counts is sucient to represent the value of each decision. We impose two constraints on value nodes: (1) a value node must have a decision node as one of its parents and (2) there can be at most one value node per decision node. These two rules narrow the space of possible in uence diagrams, which helps during structure learning, and simplies generation of a utility function by enforcing a pairing between actions and utilities. Based on these rules and assumptions, the expectation maximization (EM) algo- rithm used in Bayesian network parameter learning can be applied to our in uence diagrams. The values for each decision node are directly known from the data, and the utility functions can be determined through frequency counting of actions. This leaves the probability distributions for each chance node, which can be resolved using direct application of the EM algorithm. 3.2.1.2 Structure Learning The addition of the two new node types, decision and value, greatly increases the space of possible networks beyond the standard Bayesian network representations. To further complicate matters, there is temporal information that must be properly represented in an ID. In the previous section we introduced a few constraints for IDs. Value nodes must have a decision node as one of their parents, and there can be only one value node per decision node. These two constraints limit the range of possible networks without detracting from their ability to represent task policies. Placement of value nodes in an 28 C h an c e Decision Action A Action B Action C Chance Value State 1 0.8 State 2 0.1 State 3 0.1 decision value C h an c e decision Action A Action B Action C State 1 State 2 State 3 0.8 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 Influence Diagram Bayesian Network Conversion Figure 3.3: A simple in uence diagram converted to a Bayesian network. The value and decision nodes of an in uence diagram are combined to form a special chance node depicted by rectangle with rounded corners. ID has no causal eects, since value nodes must be leaf nodes, and can only be evaluated as a utility function. By directly tying value nodes to decision nodes, an ID can be converted to a Bayesian network. Every decision and value node pair can be replaced by a chance node. The new chance node has a conditional probability table that consists of actions from the decision node where the probabilities are the utilities from the value node. Graphically, this new node is shown as a rectangle with rounded corners as shown in Figure 3.3 above. The ability to structurally convert between an ID and Bayesian network allows us to apply the Bayesian network structure and parameter learning algorithms to IDs. This approach reduces the search space to that of a Bayesian network, and allows us to implement just one structural learning algorithm. Given a data set consisting of features and actions, learning the structure of the corresponding ID proceeds according to Algorithm 3.1, given above. 29 Algorithm 3.1 Structural learning algorithm for in uence diagrams Require: Data setD =fd 0 ;:::;d n g of n training examples d i =f< f 0 ;a 0 >;:::;< f m ;a m >g of m feature vector and action pairs. Let S be an initial random structure that consists of chance nodes C for the features f and actions a. for all d i inD do repeat BIC =max A BIC(A) and A =argmax A BIC(A). if BIC > 0 then Set S =op(S;A ) end if until BIC 0 end for for all chance nodes c j in C do if c j represents an action then d new domain names of c j v new values of P (c j jpa(c j ) replace c j with d new and v new end if end for 3.2.2 In uence Diagram Updating Once an in uence diagram has been generated from demonstration data, it may be desirable to update and modify its parameters. The original demonstrations may have been incompletely, not covering all the situations the robot could experience, or the original demonstrations may have been incorrect. Incorrect demonstrations include instances where the instructor wishes to alter the way a robot solves a task. For example, the instructor may have initially taught a robot to measure out all the ingredients for a recipe rst, but now would like the robot to cook in a more incremental fashion. Incorrect demonstrations also include instances where multiple dierent solutions to the same task have been shown to the robot, and the learned in uence diagram is unable to determine the correct solution. This occurrence becomes evident when a robot attempts to use an in uence diagram, and computes 30 two or more decisions with the same probability. When this event happens the robot generates a notication that an instructor may respond to by supplying the robot with the correct decision. An in uence diagram is updated by directly incorporating new data into the existing diagram. Samples from the current network are generated, then compared with the updated data set. All samples that are inconsistent are replaced with the updated data. The resulting data are then used to retrain the network. The process of incorporating new data into an existing network requires little time to complete. It is therefore possible to update an in uence diagram in an on-line fashion. Following each action executed by the robot, a human observer can choose a correct action or let the robot continue. If an alternative action is selected, the robot backtracks one step and updates its in uence diagram. Execution of the task then continues. The ability to backtrack and undo an action is limited with real robot due to the irreversible nature of some actions, such as mixing ingredients or cracking eggs. An alternative approach is to use simulation as a tool to train and correct a robot. This technique allows full use of backtracking, and is describe in Section 5.4.1. 3.2.3 Hierarchical In uence Diagrams In addition to their compact representation, in uence diagrams can be structured hi- erarchically. A complete in uence diagram can be encapsulated in a node within a larger in uence diagram. Multiple in uence diagrams can be combined together, which eectively allows a robot to reuse knowledge. Figure 3.4, given below, depicts the hierarchical structuring of tasks in our approach. 31 st at e decision value decision value st at e decision value st at e Network C Network A decision Network B Figure 3.4: Hierarchical structuring of in uence diagrams, where tasks reference other self-contained tasks. An in uence diagram representation of a task may be reference multiple times from dierent diagrams. 32 3.3 Complexity Task learning requires one or more data sets from which the structure and parameters of an in uence diagram may be learned. In this work, the data used during learning come from demonstrations provided by human instructors. Ideally, the number of demonstra- tions is kept to a minimum, in order to prevent instructor fatigue and reduce training time. In the worst case, a randomly generated data set will generate an ID that is unable to produce a correct solution for the task. The probability of a correct randomly generated data set is proportional to the size of the state space and number of available actions, formally expressed by: p = 1:0 SA (3.1) where S is the state space size and A is the number of actions for each state. We have introduced simplications to this equation that include the assumption that all states are accessible from each other and that each action may be executed in each state. Using this equation, the likelihood of a correct random data set for a task with S = 27 and A = 6 would be 1 in 162. Due to the low probability of such an occurrence, we are discounting the random data set that does encode the correct solution. Additionally, application of a random data set will be detrimental to the learning process, as the ID will have a positive example of an incorrect solution. To overcome this situation, at least twice as many correct data sets, for each state of the random data set, are needed in order to correct the ID, given that an ID selects actions based on the frequency of occurrence. During practical application of in uence diagrams, random data are highly unlikely to be used as they do not aid the learning process. The use of a single task demonstration 33 is sucient to generate an ID that is capable of producing a correct solution. The demonstration itself must be correct, and a robot that makes use of the ID must have perfect sensing of features and operate in a static environment, relative to the features. This means that a single training example only provides one sample of the state space, and any deviation from that particular path through state space will likely produce incorrect results. When a deterministic environment cannot be guaranteed, the minimum number of demonstrations can be calculated based on the desired degree of error recovery. Assum- ing that the rst demonstration of a task is correct, a rst-order ID would require demon- strations for all the states adjacent to the rst demonstration. These additional demon- strations will allow the ID to accomplish the task in the presence of non-consecutive errors. Additional demonstrations can improve an ID's ability to recover from error. Each additional order of recovery increases the number states that must be demonstrated, up to the maximum of the complete state space. Based on this, the number of demonstra- tions required to train an ID is bounded by: 1DSA (3.2) where D is the number of demonstrations, S is the size of the state space, and A is the number of possible actions in each state. The bounds described above assume that demonstrations are provided in a well- structured manner. However, human-generated demonstrations may not be well- structured, and can be inconsistent. The increase in the number of required demon- strations can be approximated using a consistency measure among instructors. With a measure of consistency C in the range of 0<C 1, Equation 3.2 becomes: 34 1D SA C (3.3) Determining a measure of instruction consistency is a challenging problem beyond the scope of this dissertation. A formal approach to that problem would help to deter- mine demonstration bounds and would also oer insight into how to structure the task demonstration procedure. 3.4 Limitations of the Approach There are several limitations inherent to the in uence diagram (ID) representation, as follows. In uence diagrams are designed to evaluate decision processes, and therefore limit how LfD can be applied to robotics. Learning at the motor control level, including learning actions that require fast control loops, such as motion control and trajectory planning, cannot be accomplished using IDs. Action timing in decision trees may also be easier to identify than in in uence diagrams, even though the same information is represented. The size and complexity of the conditional probabilities also increase rapidly with model complexity. This is further compounded if chance nodes have multiple outcomes. The structural learning algorithm for in uence diagrams is a search through the space of possible networks. The search process can be time consuming, and is dependent on the number of nodes and the complexity of the conditional probability tables in the chance nodes. On-line learning of network structure is therefore not viable. However, parameter updating can be accomplished in an on line fashion since a search process is not needed. 35 In spite of the above inherent limitations, the chosen ID representation is highly eective for life-long learning from demonstration, as summarized next. 3.5 Summary This chapter introduced our approach to using and adapting in uence diagrams as a means to encode and solve decision problems. Our approach relies on demonstration data of a task from one or more instructors. Decisions and values are encapsulated into a special chance node, which facilitates the use of a structural learning algorithm designed for Bayesian networks. Once a network has been generated, the special chance nodes are recast back to decision and value node pairs. In uence diagrams provide a convenient mechanism to learn goal-oriented decision processes from demonstrations. Their compact representation reduces storage require- ments. Furthermore, in uence diagrams are human-readable, and can be structured hierarchically, making it possible to combine multiple diagrams in order to represent complex decision processes. The three key contributions described in this chapter are: In uence diagram parameter learning The EM algorithm has been adapted for use in in uence diagrams by constraining the structural layout of decision and value nodes. Decision nodes must be paired with a value node in a one-to-one relationship. In uence diagram structure learning In uence diagrams can be cast to a Bayesian network by leveraging the decision and value node pairing constraint. The structure of a Bayesian network can then be learned from data using the standard structural EM algorithm, and cast back to an in uence diagram. 36 In uence diagram updating An in uence diagram can be updated by sampling from the diagram, replacing incorrect samples and retraining the diagram. This procedure can be performed either on-line or oine. The next chapter integrates learning from demonstration with in uence diagrams. This combination is achieved using a few novel algorithms that produce a life-long learning from demonstration approach to task learning. 37 Chapter 4 Life-long Robot Learning Life-long learning requires an entity to gain new information over its lifetime, and use this information in a meaningful manner. In this chapter we focus on how task information is collected, used, and revised. People are able to rapidly understand new tasks based on our ability to reuse and adapt prior knowledge. Over the course of life, we participate in numerous experiences that form the basis for learning. This large and expanding reservoir of knowledge allows us to function and adapt quickly in a complex world. Robots, however, do not have an innate ability to continually learn and adapt. In- stead of acquiring knowledge over time, robots are generally manually programmed. Learning algorithms can provide a solution to the knowledge generation problem. How- ever, learning control policies is often time consuming and requires signicant data. These two issues can be alleviated if a robot is able to access and eectively use prior knowledge. Life-long learning, in conjunction with learning from demonstrations (LfD), allows a robot to gain and reuse knowledge over the course of its functional lifetime. A key 38 element is the ability to reuse prior knowledge, and integrate this knowledge with LfD. The result is a Life-Long Learning from Demonstration (L3D) approach that simplies the demonstration process by utilizing prior knowledge and facilitates composition of complex task from subtasks. 4.1 L3D Approach When learning a new task, one generally does not rely on one or more discrete and well dened demonstrations. Instead, one typically performs a task after relatively few demonstrations and then integrates corrections from an instructor or from self- evaluation. The ability to provide corrective information to a robot executing a task is rarely provided to an instructor, and for this work we assume a robot is incapable of self-evaluation. Incorporation of corrective information relies on a the ability to store and recall knowledge. Our approach provides a robot with the necessary means to reuse prior knowledge, as follows. First, the robot acquires new knowledge through one or more demonstrations from an instructor. The data from demonstrations are used to learn both the structure and parameters of a task represented as an in uence diagram, as detailed in Section 3.2.1.2. An in uence diagram generated for a demonstrated task may be used during subsequent demonstrations. During a new demonstration, the L3D process monitors incoming in- structions and compares those instructions to the outputs its exiting in uence diagrams would produce. A matching algorithm produces the list of matching in uence diagrams that are then listed on the instructor's graphical interface in order of descending likeli- hood. The instructor uses those options to select an appropriate in uence diagram for the current task. Algorithm 4.2 details the process of matching observed instructions to 39 known in uence diagrams. If no in uence diagram matches then nothing is presented to the instructor. This technique allows demonstrations to proceed at a rapid pace, and facilitates the construction of complex tasks from a combination of simple tasks learned during prior demonstrations and information from new demonstrations. For example, an in- uence diagram that encodes object fetching (move to an object, pick up the object, and move (with the object) to a new location) may be used during a demonstration of a garbage clean-up task. In that situation, the fetch in uence diagram behaves much like a predened action. This allows an instructor to reuse tasks, thereby simplifying demonstrations. The second scenario involves task modication. LfD approaches generally attempt to reduce the number of demonstrations required to learn a task, as discussed in Chapter 2. A reduction of demonstrations reduces fatigue on the part of instructor, and helps to promote more consistent and accurate demonstrations. A consequence is that sparse data can produce incorrect learned behavior. We mitigate this problem through task modication. While a robot is executing a learned task autonomously, an observing instructor may intervene and oer more guidance to correct erroneous robot behavior. The instructor only needs to provide corrective information rather than a complete demonstration. This results in fewer complete demonstrations, as the robot is able to continually adapt by incorporating new information at any point in time. On-line updating is a convenient property of in uence diagrams that facilitates task modication. Once an in uence diagram has been generated, its parameters may be adjusted at any point in time. New information about a task may be incorporated by directly inserting instructions from an instructor into a data set sampled from the appropriate in uence diagram. The incorrect instructions in the sampled data set, 40 determined based on the instructor's input, are replaced by the new instructions. The resulting data are then used to update the parameters of the in uence diagram, as shown in Algorithm 4.1, below. Sampling a network operates in linear time proportional to the number of nodes in the network and the number of samples generated. The process of generating a single sample from a network involves sampling from a uniform distribution for each node and propagating the evidence. The whole process typically consumes less than 1ms, and can by run on-line. Future work will allow the instructor to provide multiple instructions to replace an incorrect instruction. Algorithm 4.1 In uence diagram parameter updating while robot executing task do if instructor interrupts task execution then ID current in uence diagram. I incorrect new instruction. I new new instruction. D Sample(ID)I incorrect +I new EM(ID;D) end if end while Algorithm 4.2 In uence diagram matching while instructor providing a demonstration do inst current instruction. D current state. for all id i inID do inst id i Solve(id i ;D) if inst id i ==inst then Increase likelihood of id i else Decrease likelihood of id i end if end for end while 41 Create Instruction Execute Action Complete Feedback Instruction Msg Perform Action Finished Finished Figure 4.1: Collaboration diagram that shows how a single command is sent to the robot. 4.1.0.1 L3D System Interaction A complete approach to LfD includes a teacher, a GUI, and a robot. The teacher sends commands to the robot and receives feedback from the robot through the GUI. An instance of a single command is depicted in Figure 4.1, above. The teacher starts by constructing an instruction and sending it to the robot through the GUI. Upon receipt of an instruction, the robot executes the requested action. During action execution, the robot relays useful sensor and state information back to the GUI for display. The details of how the robot actually accomplishes the action and performs object detection are left to the robot designer. This process of sending commands and observing the results is continued by the teacher until they feel the demonstration is complete. During the demonstration, the robot records all the commands and sensor data. Using this information, the robot can learn a policy that matches the demonstration. 4.1.1 Parametrized Tasks A distinguishing characteristic of L3D is the ability to store knowledge and reuse the stored knowledge at the appropriate time. When a task is learned, a set of in uence 42 Fetch(feature A, feature B) Gr ipper Sta te Gr a spa bl e Feature A Feature B Move to Pick up Done F ea tu r e A c tio n V a l u e V a l u e (a) Fetch(feature A, feature B) Gr ipper Sta te Gr a spa bl e F ea tu r e A c tio n (b) Fetch(feature A, feature B) Gripper State Graspable Feature Action A B empty full 1.0 0.0 0.0 1.0 Feature Graspable Move To Pick Up Done A False 1.0 0.0 0.0 A True 0.0 1.0 0.0 B False 1.0 0.0 0.0 B True 0.0 0.0 1.0 (c) Fetch(feature A, feature B) Gripper State Graspable Feature A Feature B Move to Pick up Done Feature Action Value Value Feature A B GripperState Value Empty 1.0 Full 1.0 Feature Graspable Action Value A False Move to 1.0 A True Pick up 1.0 B False Move to 1.0 B True Done 1.0 (d) Figure 4.2: Parametrization of a fetch task, including the four learning steps that pro- duce the nal in uence diagram (ID). Step (a) is the initial ID, (b) the ID converted to a Bayes network, (c) the contents of the ID: the structure and parameters following structural EM, and (d) the nal ID converted back from the Bayes network format. diagrams are generated. These encode the actions a robot should take in order to achieve a goal and can be reused in many dierent environments, limited by a dependence on the set of features originally used during the demonstrations. For example, a robot shown how to fetch a cup will not be able to fetch a bowl, even though the same set of actions and requirements are needed in both instances. Features used during demonstrations provide instructors with an intuitive teaching mechanism, and ground the demonstration in real-world sensory data. The instruc- tors build human-readable sentences composed of verbs (actions), and nouns (features). However, the features used during a demonstration are not mandatory, and they may be replaced with abstract concepts that only become grounded during task execution. The features encoded in an in uence diagram may now be variables that are set at execution time by passing in appropriate values. 43 Returning to the fetch example, suppose that an instructor demonstrates the process of moving to a cup, picking up the cup, and then moving to a table. The demonstration makes use of two features (cup and table), which we will call Feature A and Feature B. Three actions are also used (move to, pick up, and done, signifying the task is complete). There is no need to abstract the two actions since they are both required to complete the task. The in uence diagrams generated for the fetch task now contain abstract features that act as parameters, namely the feature to pick up and the place to put the fea- ture. Any future tasks that require fetch can simply specify these two features without the need to provide additional demonstrations. Figure 4.2, located above, graphically depicts the fetch network and the stages it transitions through during learning. 4.1.2 Hierarchical Parametrized Tasks Three steps are necessary in order to make learned in uence diagrams usable across dierent tasks, as follows: 1. The rst step nds the sequences of instructed actions that have the greatest potential for reuse. 2. The second step involves generating in uence diagrams for those sequences of actions and the diagrams abstract features. 3. The third step involves forming a hierarchical structure that describes a demon- strated task, and facilitates knowledge reuse. A complete demonstration of a task may contain many useful subtasks that are applicable for knowledge reuse. For example, a demonstration of table setting contains multiple examples of fetching objects. The fetch subtask is applicable to a wide range of tasks, while the table setting task itself has a more narrow range of usefulness. 44 Instances Action Set Size Action Sets 20 15 10 5 0 1 2 3 4 5 6 7 8 Figure 4.3: Instances of action series of various length from a demonstration set involving 31 instructors each who provided on average 77 instructions. Subtasks generally occur repeatedly during a single demonstration and across demonstrations from various instructors. High frequency of occurrence is used to iden- tify subtasks, and also to construct the nal hierarchical structure of the complete task. In order for these subtasks to be used by an instructor at a later point in time, the subtasks need human readable labels. Currently, these labeled are provided by the robot operator. However, it is possible to replay a subtask to an instructor and ask them to provide an appropriate label. 4.1.2.1 Step 1: Identify Action Sequences The approach used to identify in uence diagrams is dened in Algorithm 4.3, below. This algorithm identies series of identical actions that operate over a similar pattern of features using a sliding window approach. We bound the size of the sliding window to series of two to four actions. The lower bound is chosen based on the need of at least 45 two actions to form a subtask. The upper bound is chosen empirically, see Figure 4.3, above. Algorithm 4.3 In uence Diagram Identication Let D be a series of N instructions for an entire task. Let C be a set of occurrence counts for a specic series of actions and associated features. for l = 2! 4 do for d = 0!N do al = action list of length l fp = feature pattern of lengthlfWherefp is the pattern of similar features used for each action in the list of actions al.g C[al;fp] C[al;fp] + 1 end for end for for each c in C do if c>N 0:1 then Generate reusable in uence diagram. end if end for 4.1.2.2 Step 2: Generate Abstract Features The second step in generating a reusable subtask is to generate abstract features for the subtask from the list of actions identied in Step 1. Features with the same name in the subtask are renamed with the same abstract feature name. The new set of abstract features dene the parameters for a subtask. When a subtask is executed, the abstract features are set to the concrete features relevant to the current environment. As an example, the fetch task makes use of three features of which two are identical: the feature to move to and the feature to pick up. 46 4.1.2.3 Step 3: Construct Task Hierarchy A task hierarchy is generated from the demonstration data from the bottom up. Action series that match the shortest subtasks are replaced rst. These are extended with longer subtasks until the complete task has been processed. A hierarchical structure reduces redundancy and facilitates learning of new tasks. For example, a demonstration of a new task may contain two actions in a sequence that the L3D process has previously observed. When this is detected, the L3D process can oer to execute the detected subtask. Detection of a previously learned subtask is accomplished by comparing all known subtasks to the commands received by an instructor during a demonstration. The L3D process oers possible actions through the graphical interface. The instructor can select either one of the oered actions, or generate a completely new command. 4.2 Error Handling This section describes how our L3D approach handles dierent types of error conditions. 4.2.1 During teaching This section is concerned with errors that occur during the process of providing demon- strations to a robot. Missing or inaccessible feature This situation occurs when an instructor requires a feature that the robot does not recognize or cannot properly interact with due to physical limitations. An error of this nature cannot be overcome without changing the capabilities of the robot or modifying the environment. 47 Our current implementation of L3D assumes that all necessary features are ac- cessible by the robot. Future work will address this error condition to a greater extent by providing a feedback mechanism for the instructor. This mechanism will allow the instructor to report this error condition which will then be logged for an engineer to process. Missing action This situation occurs when an action necessary to complete a task is not in the robot's repertoire. Similar to the previous error, this condition cannot be handled by the instructor. The current implementation requires all necessary action to exist within the robot's repertoire. A life-long learning robot would benet from the ability to learn new actions from an instructor. Future work will address this error condition through integration of an action-learning algorithm based on work from current literature. This would result in L3D learning both low-level actions and high-level task plans. 4.2.2 During Task Execution This section is concerned with errors that occur when the robot is autonomously exe- cuting a learned task. Incorrect action selection During autonomous execution of a task the robot will occasionally choose an incorrect action. The likelihood of this scenario increases when demonstrations of the task used very dierent series of actions. If an instructor is present while the robot is operating, then the instructor may intervene and correct the action. A simple interfaced is used by the instructor to stop the robot, select the incorrect instruction, and provide a new instruction. 48 If an instructor is not present the incorrect action may eventually lead to a failure to reach the task goal state. When this happens the instructor can replay the robot's actions and oer corrections. The sequence of states observed by the robot and actions executed are recorded by the robot when it executes a task. These actions and features are human-readable strings, that are easily interpreted by an instructor. The primary diculty faced by the instructor is identication of when the er- ror occurred. We facilitate this process through a graphical tool that allows the instructor to step through the robot's log of actions and features. A visualiza- tion each step is provided through simulation. When the location of the error is identied the by the instructor, they can insert the correct instruction. The new instruction is incorporated into the in uence diagram as discussed in Section 4.1. Unknown state Dynamic environments, and environments that dier from those used during demonstrations can cause L3D to enter a state that has not been encountered. The result is that all actions have equal probability. This scenario is detectable, and is reported to the instructor who can then oer further guidance. The robot can not make a decision without guidance from an instructor in this scenario and must therefore wait for help. 4.3 Summary This chapter described the necessary components of our L3D life-long learning approach. L3D's requirements are that it must: 1) learn new tasks without modication of the underlying algorithms, 2) use knowledge gained in the past to aid future task learning, and 3) be amenable to on-line updates from a human observer. 49 Learning from demonstration is a natural t for life-long learning that aids in ad- dressing the above requirements. LfD allows instructors to choose what the robot should learn, and LfD bootstraps the tasks learning process through one or more demonstra- tions. The in uence diagrams representation used was chosen for its compactness and ability to integrate new information through on-line updating. Our complete L3D ap- proach can learn new tasks, leverage knowledge from previous demonstrations, and rene tasks during robot execution. The primary limitations of L3D are the reliance on predened actions and features, and the inability to learn low-level actions. Advances in perception and object recog- nition will increase the range of detectable features. L3D may also be extended to support dierent learning modes for both task and control learning. The next chapter describes our graphical interface, and the advantages of using simulation for learning from demonstration. 50 Chapter 5 Teaching Interfaces There exist many methods to provide guidance to a robot in the learning from demonstration context. In this chapter we describe our method of providing task demonstrations to a robot through the use of an intuitive graphical interface. This chapter is focused on the demonstration interface and method through which instructors communicate with a robot learner. We developed two graphical interfaces, using lessons learned from the rst to signicantly improve the second version. Both interfaces include bidirectional communication between an instructor and robot student. The rst interface utilized a mobile device to communicate with a physical robot. While this interface allowed an instructor to operate in the same space as the robot student, it suered from a small size and poor bandwidth. A second interface alleviated these problems by conducting training in simulation with a full-screen graphical interface (GUI). Teaching in simulation oers several ad- vantages over in situ teaching, including a reduction in training time, ability to easily correct training errors, and no direct reliance on robot hardware. 51 5.1 Components Figure 5.1: Learning from demonstration components. In general, multiple demonstra- tions, possibly from many dierent instructors, are required before a robot can reproduce a task. At the most abstract level, LfD consists of two primary components: task demon- strations from a human instructor, and task reproductions from a robot student, as shown above in Figure 5.1. Similar to human-human teaching, multiple demonstrations are required in order to learn a robust and accurate task representation. The number of demonstrations and their length are ideally kept to a minimum to prevent instructor fatigue and allow learning to take place in a reasonable time frame. Figure 5.2: Complete framework for learning from demonstrations. Feedback from the robot to the instructor is an optional component. Instructors can provide more demonstrations to x any ambiguities or errors in the learned task model. When teaching via demonstrations, the ow of information between the instructor and robot is not one-directional. Feedback from the robot is vital to the instructor. This information allows the instructor to determine the current state of the robot, and helps the instructor determine how the demonstration should proceed. Feedback can take many forms, including direct verbal communication, gestures, and non-verbal 52 cues [Chernova and Veloso, 2007, Thomaz and Breazeal, 2006a]. Figure 5.2, shown above, details all the components in a LfD framework. 5.2 Role of the Instructor In a teaching scenario, an instructor acts as an expert on a particular subject. We as- sume the instructor is knowledgeable on the subject matter, and is capable of generating a demonstration in a non-malicious manner. In the context of people teaching robots, it is possible that the instructor does not understand a robot's physical and expressive capabilities. It is unrealistic to assume an instructor will have sucient time and expertise to learn about the robot prior to giving demonstrations. Instead, we place the burden on the robot to express itself in a manner that is easily understood by an instructor. Using this approach, instructors can behave naturally and provide better demonstrations to the robot. We address this issue by providing the instructor with textual feedback on the graphical interface. 5.3 Role of the Student As a student, the job of a robot is to observe its environment and the actions given by an instructor during a demonstration. Based on the information gathered, and all similar demonstrations, the robot should construct or adapt a representation of the task that is suitable for autonomous execution. An instructor relies on information from the student to judge whether or not the student understands what is being taught, and how to proceed with the instructions. It is therefore a vital role of a robot student to accurately convey its state in a timely manner for learning to proceed smoothly. 53 5.4 Human to Robot Knowledge Transfer The rst step in LfD is to provide one or more demonstrations to a robot. This step aims to accurately transfer skills from an instructor to a robot. The method through which skill transfer takes place is largely determined by the type of learning. In this work the robot is designed to learn tasks, and we have chosen to use a graphical user interface (GUI) as the method to transfer of skills from instructor to student. Our GUI provides a rich and customizable medium for an instructor. Due the ubiquity of personal computers in our daily lives, most people are familiar with graphical interfaces. This reduces the learning curve for an instructor and improves their comfort level when interacting with a robot. 5.4.1 Human-Robot Communication The ability to communicate eciently and naturally is a requirement for any teaching scenario. For most situations it is assumed that both parties share a common communi- cation medium. However, this assumption does not hold when interacting with a robot. With little day-to-day robot interaction, most people have ill dened notions of how to interact with a robot. Furthermore, any information provided to a person about a robot prior to interaction will skew their beliefs about the robot's capabilities [S. Paepke, 2010]. Given that an instructor most likely has little or no knowledge about how a robot communicates, the robot should communicate through a medium that is natural and easily understand by the instructor. Following this rule, the instructor can act more comfortably and ideally provide better demonstrations. Both input to the robot and output from the robot should be expressive enough to capture relevant information eciently and intuitively for an instructor with minimal to no prior training. 54 During a demonstration, the primary job of the instructor is to pass information to the robot. Numerous options are available to the robot including speech recognition, visual observation, joysticks, direct manipulation, and a graphical interface. As previ- ously discussed, there are advantages and disadvantages to each option; we chose to use a graphical interface that utilizes an instructional language composed of actions and objects on which the actions should be applied. 5.4.1.1 Instructional Language The two graphical interfaces we developed for LfD each used a human readable instruc- tional language that took the form of simple imperative sentences. The purpose of the language is to provide a method of passing instructions to a robot, while placing minimal burden on the instructor. preposition (modifier) verb (action) noun (feature) Examples: optional move to table pick up newspaper put newspaper down in trash Figure 5.3: The language used by instructors to pass teaching commands to a robot student. The rst graphical interface displayed two lists of words. Actions were contained in the rst list and features in the second. By selecting an action and then a feature, the instructor could instruct the robot how to act. This interface was sucient for simple tasks, but was limited in the range of tasks it could represent. For example, it was not 55 possible to string together instructions that contained multiple features as in put down the cup on the table. The second graphical interface used a dynamic list of words in order to create an instruction. In addition to verbs and nouns, prepositions were added, as shown in Figure 5.3 above. The rst word an instructor would select is always a verb. Depending on the verb selected, a dierent set of list boxes would appear. The set of list boxes presented to an instructor is based on the requirements of the selected verb. For example, in order to put a feature down, both the feature to put down and the nal location of the feature must be specied. This approach gave the instructor much more freedom to choose instructions, and increased the range of tasks the language could represent. 5.4.1.2 Audio Cues Robots can also make use of natural communication methods in addition to graphical interfaces, as validated in Section 6.2. Audio cues are one medium that is readily accessible to robots, and can typically be straightforward for a person to understand. Figure 5.4: The web-based survey interface used by participants in the Mechanical Turk audio cue validation study. An audio le is played, and the participant must select what they believe the sound expresses. 56 (a) (b) (c) Figure 5.5: Results from the Mechanical Turk sound survey. Each plot shows the expression histogram for the (a) success, (b) error, and (c) acknowledge sound. Utilizing the speakers on a robot to emit sounds is straightforward to implement. However, it is important to choose sounds that are meaningful and understandable by the user. Sounds that may cause confusion or misunderstanding will only hinder the learning process and degrade the demonstrations provided by the instructor. In an attempt to limit any misunderstandings in communication, we validated a set of robot sounds using a survey method described below. We enforced an additional constraint of using only non-verbal robot communication. This decision allowed us to avoid genderizing the robot through speech (male or female, 57 regardless of if natural or synthetic), minimized the implied intelligence of speech, and eliminated language barriers. Working with sound designers from Pixar Animation Studios, we developed a set of ten non-verbal audio cues. In order to create a valid mapping from sound to meaning, we conducted an on-line survey. Each participant completing this survey listened to the ten sounds in random order. For each sound, the participant was asked what verbal expression best matched the sound. The participant was informed that the sounds are used by a robot, and a picture of a robot was accompanied with the sound, as shown in Figure 5.4 above. Figure 5.6: The web-based GUI designed to facilitate LfD. Each box is a self-contained widget that can be customized. Additional widgets are easily added, which allows for greater exibility. This study was posted on Mechanical Turk [Amazon, 2010]; we collected surveys from 100 people. The results inform which sounds in the tested set are less ambiguous than others. This information helped to guide our choice of sounds for the robot, with the goal of reducing misinterpretation by a teacher. Figure 5.5, above, contains histogram plots for the three least ambiguous sounds in the set. 58 5.5 Graphical Interface: First Generation The rst graphical interface we developed consisted of ve Web-based widgets, as seen in Figure 5.6, above. Each widget was a self-contained entity with clear borders and a descriptive title. The amount of text and buttons has been kept to a minimum in order to reduce confusion. Starting with top right widget and moving clockwise, the GUI contained an Action Selector widget that allowed an instructor to issue commands to the robot. Inside this widget are two text lists. The list on the left, labeled Actions, contains all the available actions the robot can perform without any additional guidance from the instructor. To the right of the action list is the Objects list, which contains all the features in the environment with which the robot can detect and interact. We assume object recognition and primitive actions are provided to the robot, as described in Section 3.1.1. Using the available actions and objects, the instructor can construct a command for the robot. For example, selecting Move-to and Red Disk tells the robot to move to the red disk. These verb-noun pairs serve to help the instructor understand the capabilities of the robot by phrasing the robot's capabilities in human-readable form. Below the Action Selector is the Robot Status widget. Displayed within this widget are status messages from the robot. This includes messages pertaining to the progress the robot is making toward completing a given command, and error messages. The position of this widget was chosen to lie closest to the Action Selector widget. The purpose is to promote two-way communication,and encourage the instructors to use the status messages during the teaching process. The next widget is titled Task Complete, and contains a single button. This widget exists to provide a mechanism that indicates when a demonstration is complete. The separation of task completion from action selection reduces instructor confusion, and provides a clear ending to a demonstration. 59 Moving clockwise, the Instructions widget displays a concise list of instructions for the teacher. While not entirely necessary, providing some measure of help on the GUI prevents the instructor from looking elsewhere for technical help. The nal widget, located in the upper left, is the Camera View. The images displayed in this widget are a video stream from the robot's camera. This widget's primary purpose is to provide the instructor with context about what the robot observes. This widget also allows an instructor to teach the robot remotely. Sections 6.1 and 6.2 contian information regarding the validation of the GUI. 5.6 Graphical Interface: Second Generation Left Robot View Saucepan Bowl Left Right Feedback Undo Action Move to Put down Pick up Stir Wait Right Finished 1 2 3 4 5 6 7 8 8 9 Figure 5.7: Simulated teaching interface. 1. Robot camera view, 2. Contents of saucepan, 3. Contents of bowl, 4. Left gripper, 5. Right gripper, 6. Robot feedback, 7. Instruction interface, 8. Change view left/right, 9. Demonstration complete. Following evaluation of the rst GUI, we determined that an alternate teaching interface was needed. The rst GUI was limited in size and functionality due to the 60 constraints imposed by the mobile device. Additionally, the robot's execution of actions was slow and distracting for the instructor, as describe in Section ??. A second GUI was developed that utilized a standard computer in conjunction with a simulated robot student. This design choice alleviated the size and functionality constraints, and allowed the robot to operate faster than real-time. The simulation-based GUI, shown in Figure 5.7 above, contains many of the same components as the rst, mobile GUI. A window in the bottom left of the GUI contains the view from the robot's camera. A nished button in the top right allows the instructor to determine when a demonstration is complete. Finally, a set of lists in the bottom right allows the instructor to pass commands to the robot. In the new GUI, the instructor can change the view of the environment by selecting either the Left or Right buttons, which cycle through a set of viewpoints. This allows the instructor to move around in the environment without the complexity of moving a camera in 3D space. Feedback and state information are displayed in a set of text boxes along the bottom of the screen. Textual feedback from the robot is displayed in the bottom right, and the contents of the robot's gripper are displayed to the left. Located above textual robot feedback is an undo button that allows an instructor to withdraw the previous command. Undo removes the instruction and state information from the demonstration data and transports the robot back to the state it was in prior to the instruction. The undo button may be used to remove all instructions up to the start of a demonstration, and is only applicable in a simulated environment due to the complexity associated with moving a robot back to a previous state. The undo function may be used to correct an error, for example commanding the robot to pick up an object when both grippers are occupied, or to choose a dierent 61 sequence of actions. No information about an undo operation is kept, and is therefore not used during learning. The nal feature of the second GUI is a guided tutorial. The tutorial describes each of the GUI components, and allows the instructor to practice using the GUI. Two additional user studies were used to evaluate this new graphical interface. The rst study performed to evaluate the new GUI is discuess in Section 6.3. Follow- ing this study, two new components were added to the GUI. The rst allowed the PR2 to identify when the instructor was demonstrating a skill previously learned and oer the instructor an option to let the robot complete the skill autonomously. The second was the ability to specify where features should be placed when it is put down. This is accomplished by selecting a reference feature and a position relative to the reference feature. For example, to put a bowl down to the left of a plate, the instructor rst selects the plate and then position to the left of the plate. These two additional components added more exibility and control for the instructor. A second study utilized this GUI, discussed in Section 6.4. 5.7 Summary We developed and evaluated two dierent teaching interfaces for our L3D approach in order to select the most important properties for facilitating the teaching process. The nal, simulation-based interface allows instructors to easily guide a robot through task demonstrations. The interface provides information to the instructor through sensor feedback and robot state information. The instructor may then use this information to guide a robot through a task by constructing imperative sentences. The rst graphical teaching interface was developed for use on mobile devices. The limited screen size of mobile devices made it dicult to demonstrate complex tasks that required a larger instruction set. A second teaching interface moved away from mobile 62 devices and into simulation. This design decision reduced demonstration times, removed ties to robot hardware, facilitated a larger instruction set, and allowed the instructor the ability to easily undo commands sent to the robot. The next chapter utilizes the teaching interfaces with life-long learning from demon- stration in set of user studies involving dierent tasks. 63 Chapter 6 Experimental Validation This chapter presents the evaluation human-robot (HRI) studies of life-long learning from demonstration (L3D), the approach described in this dissertation. Four studies are presented. The rst two validate the use of the graphical interface and audio cues, the last two validate the process of task learning via in uence diagrams. To evaluate the described L3D approach, we conducted a total of four human-robot interaction (HRI) user studies. The following summarizes the studies, which are then discussed in detail in the following sections. Study 1: Visibility During Teaching This goal of this study was to determine how visibility of a robot eects the performance of an instructor. The participants demonstrated the three-disk Tower of Hanoi puzzle task to a PR2 robot. Half the participants could see the robot while demonstrating/teaching, while the other half could not directly see the robot. The results of the study provide insight into how teaching remotely can eect demonstrations provided to the robot. In addition, the actions and observations 64 made by the robot during demonstration were collected and the data were used to learn an in uence diagram capable of autonomously solving the Towers of Hanoi puzzle task. Study 2: Improved Communication Using Sounds Based on the results from the rst study, it was apparent that an additional mode of communication between the robot and instructor is necessary for instructors who could directly observe the robot. Additionally, instructors who could not observe the robot would benet from an additional mode of communication, but to a lesser degree. The purpose of this study was to evaluate the eectiveness of audio cues as a method for the robot to share state information with the instructor. The study task was box sorting, wherein colored boxes were sorted into labeled bins. Study 3: Simulation Base Teaching The rst two studies showed how communication modalities between an instructor and student can eect learning from demonstration (LfD), and validated that an in uence diagram approach to skill representation ts well within the LfD paradigm. However, the graphical interface used for teaching was limited to simple tasks, and providing demonstrations to a physical robot was shown to be time consuming and prone to errors. This study improved the graphical interface by providing the instructor with greater exibility in the commands sent to the robot. Additionally, demonstra- tions were conducted in a simulated environment that operated faster than real time thereby reducing demonstration time. The demonstration task was accord- ingly increased in complexity to highlight the ability of our approach to generalize over many divergent demonstrations of the same task. Instructors used a risotto 65 recipe to teach a robot how to make risotto in a kitchen environment. Following a set of demonstrations, the in uence diagrams were run on a physical robot. Study 4: Skill Transfer In order to demonstrate how skills learned from demonstrations of one task may be used to help learn a new and dierent task, in this study, a robot was taught how to set a table and reused the skills from the risotto task to reduce both demonstration time and the number of in uence diagrams generated. The rest of this chapter provides the details of each study, including the study designs and analysis of the collected data and results. 6.1 Study 1: Visibility During Teaching Since demonstrations are a key part of the L3D approach, we are interested in the opti- mal conditions in which to conduct the demonstration sessions, and the environmental factors that may aect an instructor's actual and perceived performance. As part of that larger challenge, in order to examine the eects of visual obstruction during human-robot interaction, we designed a study which controlled for robot visibility while teaching a task. We postulated that direct visual access to a robot would improve a teacher's ability to understand and estimate the learning robot's state. By having a more complete understanding of the robot's state, the instructor would be able to provide better demonstrations. We chose the Towers of Hanoi puzzle as the task to be learned by the robot. Towers of of Hanoi is a classic puzzle that consists of three pegs and N disks of decreasing size. The starting state for the puzzle has all the disks on the left-most peg, and the goal is to move all the disks to the right peg. Two rules must be followed: (1) states that only 66 one disk may be moved at a time, and (2) a larger disk may not be placed on top a smaller disk. This toy problem was selected for several reasons. For the human, the puzzle is suciently challenging to be engaging without being frustratingly complex (given a smallN). For the robot, it is a solvable task that is used in articial intelligence courses because it involves a closed and highly constrained problem space. The puzzle is also well suited for evaluation of an instructors performance because there is an optimal solution, allowing for an objective comparison. To our knowledge, this is the rst study to look at the eect of visual obstructions in an HRI context. Prior research in human-human interaction lends some relevant insights. Most work related to interaction between two or more humans when obstacles are present has studied face-to-face communication versus mediated communication. These studies examine the eects of visibility [Clark, 1996, Clark and Brennan, 1991] upon human communication and coordination. Sociological literature has shown that face-to-face contact improves trust among humans and increases cooperation [Rocco, 1998], and voice communication shows marked improvement over text-based alterna- tives [Jensen et al., 2000]. Voice and face-to-face communication enables people to evaluate each other's state and attitude. Accordingly, we postulated that the visibility of a humanoid robot will benet a person's ability to understand the state of the robot in a way that would improve the process of teaching from demonstrations. Past work has shown that people are adept at creating accurate mental models of humanoid robots [Kiesler and Goetz, 2002, Sara et al., 2005]. Participants in this study had the ability to extrapolate their own internal models to t the characteristics of the robot. These models were used by people to determine the competencies of the robots, based on weak hypotheses. Thus we hypothesized that teachers with an unobstructed 67 view of the robot are more likely to build a more accurate internal model of the robot that would help them to choose when to send commands to the robot and what those commands should be. 6.1.1 Experimental Design In a 2-level (robot visibility: directly visible vs. visually occluded) between-participants study design, this controlled study investigated the in uences of robot visibility upon human-robot interaction, specically how human instructors would perform when demonstrating to a robot how to complete the Towers of Hanoi puzzle. 6.1.1.1 Hypothesis Given our previous assumption that it will be easier for an instructor to develop a mental model of the robot, given line-of-sight access, we hypothesized that such a model will be benecial to the instructor. An intuitive understanding of the robot's state should help the instructor internally formalize a proper teaching and interaction strategy, and make the instructor more comfortable with using the robot. 6.1.1.2 Participants Twenty volunteers participated in this study, with ages ranging from 22 to 59. Of those, six were female, and the remainder male. The six females were evenly distributed between the visible and non-visible teaching conditions. 68 (a) (b) Figure 6.1: The (a) physical and (b) simulator visualization versions of the Willow Garage PR2 robot and the Towers of Hanoi Puzzle. The colored balls in (b) indicate the positions of the three disks and the pegs. 6.1.2 Methods 6.1.2.1 Manipulation Each participant was randomly assigned to one of two experimental conditions. Half the participants were allowed direct observation of the robot while conducting a demon- stration. These participants are called the visible instructors. Other participants could not see the robot due to a screen behind which they were placed. These instructors could only rely on the rst graphical interface (GUI), as described in Section 5.5, for feedback from the robot, and were termed non-visible instructors. 6.1.2.2 Materials Our version of the Towers of Hanoi puzzle consists of three disks, colored red, green, and blue, with the red disk being the largest and blue the smallest. With only three disks, this puzzle is one of the simplest forms of Towers of Hanoi. The purpose of this study was not to test instructors' performance at solving the puzzle, but to evaluate their ability to use the GUI to instruct the robot under dierent visibility conditions. 69 (a) (b) Figure 6.2: Nokia N810 Internet tablet displaying the Towers of Hanoi GUI. The robot used in these studies is the PR2 mobile manipulator designed by Willow Garage, shown in Figure 6.1a above. This robot consists of a wheeled base with two 7 DOF arms, and a pan-tilt head that carries two stereo camera pairs. The PR2 is capable of navigating around typical oce environments, and detecting and interacting with simple objects. Data from the stereo head on the PR2 were used to detect the the location of the disks and the pegs. A color blob tracker working in conjunction with the stereo data identied the location of the discs. The pegs were identied based on their height, using the point cloud data from the stereo head. A visualization of what the PR2 was capable of observing is shown in Figure 6.1b, above. The code used to identify the discs and pegs were written by hand using the aid of OpenCV and ROS. Communication from the instructor to the robot utilized a handheld mobile device. With the proliferation of smart phones, mp3 players, and other portable multimedia devices, we made the assumption that most people would be familiar, if not comfortable, using a handheld device. The Nokia N810 Internet tablet, as shown in Figure 6.2a, above, was chosen based on its high-speed wireless, Web-browsing capabilities, touch screen, and open-source operating system. See Section 5.5 for a description of the graphical interface used on the n810. 70 6.1.2.3 Procedure Each participant in this study was given a short written instruction sheet that described the Towers of Hanoi puzzle, the robot, and the GUI. Before starting the demonstration, the participant completed a set of general survey questions aimed at assessing the in- dividuals computer experience and current level of frustration. The survey used in this study is described in [Lazar et al., 2004]. 6.1.2.4 Measures The set of measures gathered included the duration of a demonstration, number of valid commands sent to the robot, and the total number of commands. Demonstration duration was measured from the time the rst command was issued by the instructor, to the time the instructor selected the Finished button on the GUI. The number of valid commands was a count of all the commands that caused the robot to execute a valid action. The total number of commands was the tally of all commands sent to the robot by the instructor. It was possible to send many extraneous commands to the robot while it was executing a command. These extra commands could not be handled by the robot and were a useful measure to the degree of frustration felt by the instructor, and how well the instructor understood the state of the robot. Also recorded were demographic information including age, gender, education, and employment. We also collected data concerning computer usage and experience, and willingness to solve computer issues. This information was collected prior to completing the Towers of Hanoi puzzle. Following the completion of a demonstration, the participants were asked about their perceived mental demand, physical demand, temporal demand, performance, eort, and frustration. These statistics were drawn from the NASA Task Load Index as a measure of cognitive load [G. and E., 1988]. 71 6.1.2.5 Data Analysis All quantitative measures were analyzed using analysis of variance (ANOVA) with the study manipulation of robot visibility as the primary independent variable. The data of one of the participants were outside of the range of the others by more than two standard deviations above the mean; we considered them as outliers and the data were replaced with the group's average value. 6.1.3 Results 6.1.3.1 Number Of Commands The number of commands sent to the robot varied depending on the instructor's abil- ity to solve the puzzle eciently and their understanding of the state of the robot. Figure 6.3b, above, is plots the statistics related to the number of commands sent to the robot by each instructor. On the left side of the box plot is the total number of commands sent, split between separated and collocated instructors. On the right of the gure is the total number of commands. Instructors who could see the robot sent more valid commands to the robot (M=47.1, SD=12.3) than instructors who could not see the robot (M=36.8, SD=7.7), F(1,18)=6.12, p<:03. Instructors who could see the robot sent more commands (in total) to the robot, (M=36.7, SD=7.7) than instructors who could not see the robot(M=29.9, SD=4.0), F(1,18)=5.04, p<:04. The total number of commands indicates that instructors who could see the robot may get more frustrated with the robot and/or have an incorrect mental model of the robot than those who could not see the robot. 72 (a) Duration times (b) Command counts Figure 6.3: Box plots of the duration times and command counts between collocated and separated instructors. 73 6.1.3.2 Time On Task Based on our original hypothesis, we expected those instructors who were behind the visual obstruction and could not see the robot to require more time to complete the Towers of Hanoi demonstration. The results of this study do not support that hypoth- esis. As seen in the Figures 6.3a and 6.3b above, both the time to complete the task and the number of commands issued by the instructors actually increased when the instructor had a full view of the robot. Task duration measured the amount of time it took an instructor to complete the Towers of Hanoi puzzle. Figure 6.3a, above, shows the comparison of visible versus non-visible instructors. The mean duration is longer for visible instructors, however there is also more variance in that condition. Alternatively, those instructors who were visually blocked from the robot performed more consistently, and were able to complete the task in a slightly shorter timespan. The location as an indicator of duration is not signicant, F(1,18)=2.58, p=:13; however those data do suggest a trend. 6.1.4 Solving Towers Of Hanoi Another purpose of the rst study was to collect training data in order to enable the L3D system to learn an in uence diagram capable of solving the Towers of Hanoi puzzle. The training data used consisted of all the valid commands over the course of the twenty demonstrations. Invalid commands were ignored. Out of the twenty demonstrations, eighteen followed the optimal solution, and two deviated slightly. Figure 6.1.4, above, depicts the various states the puzzle can be in. The optimal solution is shown in green, and the two alternative demonstrations are shown in teal and with orange dashed lines. An in uence diagram was learned using the structural learning algorithm described in Section 3.2.1.2. The rst step in this algorithm learned a Bayesian network from the data, as shown in Figure 6.5a, above. That network was then converted into an 74 Figure 6.4: Enumeration of the possible states for the three-disk version of Towers of Hanoi. The optimal solution path is shown in green. Two alternative solutions, given during demonstrations, are shown in teal and orange dashed lines. in uence diagram, shown in Figure 6.5b, above. The Bayesian network structure took the system 27 seconds to generate. 6.1.4.1 Case 1: Optimal Solution In this case, the puzzle was initialized to each of eight states following the optimal solu- tion: fLLL;LLR;LMR;LMM;RMM;RML;RRL;RRRg. For each state, the in u- ence diagram was used to solve the puzzle. As expected, a solution following the optimal 75 (a) (b) Figure 6.5: The (a) Bayesian network and (b) in uence diagram (b) learned from demon- stration data for the Towers of Hanoi puzzle. path was generated. This result is rather unsurprising since most of the demonstrations followed this solution. 6.1.4.2 Case 2: Alternative Solutions In this case, the puzzle was initialized to each of the other four states provided dur- ing the demonstrations that did not follow the optimal path. These states include: fLLM;LRM;LRL;LMLg. The in uence diagram solved the puzzle for each of the last three states:fLRM;LRL;LMLg. The LLM state happened to be the only state that had con icting demonstrations. In one demonstration, the instructor taught the robot to move to state LRM while in another demonstration the robot was instructed to move back to state LLL. As a result, the in uence diagram was unable to determine a proper course of action. 6.1.4.3 Case 3: Unseen States In order to validate that the algorithm would not solve the puzzle in the absence of demonstrations, we initialized the puzzle to states that were never visited during the demonstrations. The in uence diagram was unable to solve the puzzle from each of these states. This behavior is both expected and desired. Only data from demonstrations 76 were provided during learning, and the in uence diagram cannot extrapolate to unseen states. We make the assumption that teachers do not teach maliciously, and also that they provide demonstrations that the robot should follow. Realistic tasks that are not based on toy problems will have many states, most of which are unlikely to be seen by the robot. We therefore ignore these states until they become important for the robot to understand. The one ambiguous state, and unseen states, together represent a set of conditions in which the learned in uence diagram will fail to nd a solution. While it is highly unlikely the robot will enter these states, it is desirable that the robot understand how to solve the puzzle from these states. The solution to this problem is to provide demonstrations that pass through these bad states and update the in uence diagram. Using the interactive update method described in Section 3.2.2, we provided the robot with additional demonstrations from the ambiguous and unseen states. Updating the in uence diagram in this manner did not require signicant time since each bad state only required one visit. Once the in uence diagram reached a state that it had seen before, it successfully solved the remainder of the puzzle. The end result was an in uence diagram capable of solving the Towers of Hanoi puzzle from all possible states. 6.1.5 Discussion The major nding from this study was the eect that the visibility of the robot has on the performance of instructors in the course of LfD. The number of valid commands issued to the robot increased when an instructor was allowed to have a direct line of sight to the robot. While not statistically signicant, trends were also identied that suggest that the total number of commands and overall time duration also tended to increase when the robot was fully visible to the instructor. 77 While we did not receive permission to video record the participants and code the data, observations of their behavior during the study revealed a few potential causes for the dierences in performance. Initial interaction with the robot produced a certain amount of fascination, which distracted the instructor from the demonstration process. The robot's arm movements while manipulating the disks was interesting for the in- structors to watch, much more so than the graphical interface. As a result, some visible instructors decided when the robot was done performing an action based on their own visual cues rather than direct information from the robot via the GUI. The act of putting down a disk was a rather slow and deliberate process for the robot, in order to prevent from missing the peg. Visible instructors became most impatient when watching this action. Frequently, instructors believed the robot had completed the ongoing action and tried to move to the next action even though the robot's status as still executing the action was clearly indicated through the GUI. Instructors who were visually blocked from the robot did not have the ability to make direct visual judgments about the state of the robot. Instead, these instructors relied on information provided to them from the graphical interface. As a result, they issued fewer repeated commands. In eect, visible instructors received information from the robot that lacked context. As noted in prior work, just increasing the amount of information provided does not necessarily increase one's situational awareness [Endsley, 1988]. In our context, and possibly more generally for LfD, would be more ecient to improve the display and throughput of information from the robot's sensors to the human in order to decrease confusion [Gold, 2009]. Other work has indicated that merely showing a humanoid robot to an observer causes the observer to automatically initiate the process of constructing a mental model 78 of that robot [Gockley et al., 2006, Powers and Kiesler, 2006]. Restricting some in- structors from seeing the actual robot limited their ability to generate improper mental models of the robot's states. Non-visible instructors also missed the \wow-factor" of the PR2 robot moving its arm. While the graphical interface showed a live video stream from the robot's perspec- tive, the eld of view was small and the frame-rate was low. This did not convey the same level of interest about the robot's arm movements and was therefore not as much of a distraction. The nal set of qualitative data came from comments recorded at the end of the survey. There was a marked dierence between the frequency and content of the com- ments by the participants in the two groups. Of the visible participants, seven people commented that the robot was slow, two found the experience fun, and three were neu- tral. Of the non-visible instructors, only three said the robot was too slow, one found the experience to be fun, and the remaining seven were neutral. These comments support the observation that visible participants experience greater frustration and impatience by assigning personal state information to the robot based on their visual observations. Following the demonstrations, an in uence diagram was generated capable of solving the puzzle from unambiguous states that were visited during the demonstrations. While the in uence diagram is not able to extrapolate to unseen states, it is easily updated to handle these states, as explained above. This result validates our ability to successfully generate an in uence diagram from relatively little training data. The resulting network can be updated at a future time to handle more states. There results validate that in uence diagrams are well suited for use in life-long learning in the LfD context. 79 6.2 Study 2: Improved Communication Using Sounds The Towers of Hanoi study indicated that direct observation of a robot does not provide sucient information to an instructor in an LfD context of the type we explored. Since the robot does not behave like a human, the instructor is unable to properly infer the robot's state. A more distinct and clear form of communication is therefore needed in order to aid the instructor. The use of audio cues or sounds is one option for improving communication; such cues are straightforward to implement on the robot and are quite readily understood by people. We chose three distinct sounds, to cover three key states to communicate: acknowl- edgment, error, and success. We believe that these are the key states for the LfD context in particular. As the instructor gives commands to the robot, it is straightforward to convey acknowledgment of a command, an error if one occurs, and success when the command is complete. The eect of audio cues on instructor performance was evaluated through a study that involved a new task: box sorting. Colored boxes were placed on a table and two labeled bins were located in front of the table. The goal of the task was to place each box in the correct bin, according to a provide instruction sheet. The robot and environment were simulated, however the robot simulation operated in real-time. Instructors used the graphical interface that was developed for the Towers of Hanoi study, see Section 5.5 above. The only dierence between the interface in the previous study and this one was that, in this study, the objects were labeled, consisting of six colored boxes and two bins. 6.2.1 Experimental Design In a 2x2 controlled study (audio cues: enabled vs. disabled, robot visibility: directly visible vs. no visualization), audio cues were varied between subjects and visibility varied 80 within subjects. The goal was to investigate the eects audio cues by a robot, and robot visibility by the instructor, have on interaction in an LfD teaching context. Performance measures included time on task, number of commands issued by the instructor, and the number of errors that occurred. The simulated robot was able to perform each command perfectly. As a result, any error was due to an improper command sent by the instructor. The non-verbal audio cues were used to supplement the graphical interface in this study performed in the context of a box sorting task. The choice of audio cues were validated through an on-line survey of 100 people. During the course of the box sorting study, we received no indication that the sounds were confusing. The audio cues and on-line survey results are describe in Section 5.4.1.2. 6.2.1.1 Hypothesis Since communication plays a vital role in human-human interactions, it is safe to as- sume that it will also play a signicant role in human-robot interactions. With the robot's ability to express state through audio cues, the instructor should have a more informed understanding of the robot. We thus hypothesize that audio cues will result in a reduction in the number of erroneous commands issued by the instructor, and in a reduction in the time on task. 6.2.1.2 Participants Twenty volunteers participated in this study, with ages ranging from 22 to 59. Of the twenty instructors, ten were male and ten were female. The men and women were evenly distributed among the test conditions, and the order in which the test conditions were run was randomized. 81 Figure 6.6: PR2 in the simulated box sorting environment. The PR2 is shown at a distance from the table in order to make the bins clearly visible. 6.2.2 Methods 6.2.2.1 Manipulation Each participant completed two demonstrations, one in which they could not see the robot, and one in which they could. Half of the participants received audio cues from the robot, while the other half did not. The test cases were randomly assigned and evenly distributed between genders. 6.2.2.2 Materials The box sorting task was completely simulated in Gazebo [Koenig and Howard, 2004], an open-source 3D robot simulator. Physical interactions, lighting conditions, and ma- terials were all simulated within Gazebo. The result was a well-dened and immersive environment in which a simulated robot could operate in a manner similar to the real world. A simulated version of the PR2 robot was used in this study. Six colored blocks (red, green, blue, yellow, purple, turquoise) sat in a row on a table in front of the robot. 82 Kit Sheet Bin A Bin B Red Yellow Green Purple Blue Turquoise Instructions: Place each colored block into the bin indicated at the top of each column. Figure 6.7: A kit sheet that describe what boxes belong in each bin. To either side of the PR2 were two bins, labeled A and B. The blocks and bins were placed within reach of the PR2's arms, to eliminate the need for the movement of the base. The participants were only allowed to control arm and gripper movements. The setup is shown in Figure 6.2.2.1, above. The graphical interface, described in Section 5.5 and used to communicate with the robot, ran in a Web browser on the desktop computer along side Gazebo. Changes between this graphical interface and that used in the Towers of Hanoi study included only dierent objects with which the robot could interact. A mouse was the only physical device required to used the interface. 6.2.2.3 Procedure A short instruction sheet was provided to each participant that described the process of box sorting, the instructor's role, and how to use the graphical interface. After reading the instructions, the participant was given a kit sheet, shown in Figure 6.7, above, and asked to complete a demonstration. Following the demonstration, the participant completed a short questionnaire designed to assess level of frustration. A second kit sheet with dierent bin assignments for the colored blocks was then completed by the participant, followed by the same questionnaire and a general survey that gathered demographic information. 83 6.2.2.4 Measures The set of measures collected was identical to those from the Towers of Hanoi study, see Section 6.1.2.4 above. Two sets of measures for cognitive load were gathered, one after each demonstration. 6.2.2.5 Data Analysis All quantitative measures were analyzed using analysis of variance (ANOVA) with the study manipulation of audio cues as the primary independent variable. Three participants in this study were outliers. The number of commands and number of bad commands exceeded two standard deviations above the mean. Those values were replaced with the group's mean. 6.2.3 Results 6.2.3.1 Number Of Bad Commands The number of bad commands was marginally predicted, assuming a signicance cut-o level of p = 0:5, by the presence of auditory feedback from the robot. Instructors who heard auditory feedback gave fewer bad commands to the robot (M=1.9, SD=1.5) than instructors who did not have any auditory feedback (M=3.3, SD=2.3), F(1,40)=4.01, p =:056. Figure 6.8a, above, depicts the interaction sound and visibility had on the number of bad commands. When sounds are enabled, bad commands reduce in both cases. When the robot is not directly visible, the number of bad commands is also slightly reduced. 84 (a) (b) Figure 6.8: Interaction plots of sounds and robot visibility on (a) the number of bad commands, and (b) the total number of commands 85 6.2.3.2 Total Number Of Commands The total number of commands issued to the robot was in uenced by visual access to the robot. When instructors had visual access to the robot, they gave slightly more commands to the robot in total (M=27.1, SD=2.1) than when they did not have visual access to the robot (M=26.4, SD=2.2), F(1,40)=3.99, p =:01. This eect is signicant at the p = :05 level, and is identical in nature to the rst study which indicates that the eects holds regardless of the sound condition. Figure 6.8b, located above, shows the interaction sound and visibility have on the total number of commands. Audio cues decreased the total number of commands in both cases, with a slightly more dramatic eect when the robot was visible. 6.2.3.3 Time On Task Neither auditory nor visual feedback from the robot was found to be a signicant pre- dictor of time on task, so we ran a regression analysis of robot experience as a predictor of time on task. The most signicant predictor of the time that an instructor spent on the task was the amount of experience that instructor had with robots. The more experience the person had with robots, the less time s/he spent on the task, beta=-6.30, p<:01. 6.2.3.4 Perceptions After completing each demonstration, the instructor was asked to complete a survey designed to measure cognitive load. Data from this survey indicate that instructors who received auditory feedback from the robot user interface perceived that they exerted less eort on the task (M=2.1, SD=1.5) than people who did not received auditory feedback from the robot user interface (M=3.4, SD=2.1), F(1,34)=4.60, p<:05. 86 Auditory feedback also impacted the instructors' perceived physical eort. Instruc- tors who received auditory feedback experienced a slightly greater physical demand during the task (M=1.7, SD=0.9) than people who did not received auditory feedback from the robot user interface (M=1.1, SD=0.3), F(1,34)=6.38, p < :05. No signicant dierences were found for mental or temporal demand. 6.2.4 Discussion Based on comments from the instructors, collected data, and observations from the Towers of Hanoi study, it was clear that human instructors needs more useful feedback from the robot than just visual observations. This trend was also noted in [Kim et al., 2009], where instructors did not wait for the robot to complete an action before providing more input. Results of this study show that sounds helped to improve the performance of instruc- tors. The added channel of communication provided instructors with vital information about the state of the robot. Comments from the instructors indicated that the audio cues properly informed them as to when the robot was ready to accept an instruc- tion, experienced an error, and received a new instruction. Visibility of the robot still impacted instructor performance, however the audio cues did temper this eect. 6.3 Study 3: Simulation-Based Teaching The purpose of the third evaluation study was three-fold: (1) to demonstrate the useful- ness of teaching in a simulated environment; (2) to increase the freedom instructors had while teaching a robot; and (3) to demonstrate the boot-strapping of life-long learning by generating a set of skills that may be reused during future demonstrations of dierent tasks. 87 Figure 6.9: Simulated kitchen environment with ingredients and utensils necessary to cook risotto. As in the second study, a simulated environment was constructed using Gazebo. In this study, the environment was much more complex, and consisted of a kitchen setting with utensils, ingredients, and a PR2 robot as shown in Figure 6.9 above. Gazebo also provided the graphical interface through which instructors communicated with the PR2 robot. This provided a consistent teaching interface, and eliminated the need to shift focus between the robot and teaching interface which was necessary in the previous two studies. The task used in this study was cooking mushroom risotto. All instructors were provided with the same recipe, and were free to chose how follow the recipe in order to make the mushroom risotto. Once all the demonstrations were complete, the learned in uence diagrams were executed on a real PR2 in a real kitchen environment that 88 contained the same objects, but placed in dierent relative locations. The ingredients and utensils used in the real environment were simplied so that the PR2 could observe and grasp them. For example, large and/or heavy objects can not be manipulated by the PR2 and clear objects can not be seen. Replacement objects included plastic cups, a light weight saucepan, and a knife protected in a sheath to prevent damage to the robot and people. The results from this study demonstrate the feasibility of teaching solely in simu- lation and later transferring the learned skills to a real robot. The freedom provided to instructors resulted in a wide variety of demonstrations. This produced positive and negative results. On the positive side, the robot experienced a large section of the state space. On the negative side, it is more dicult to generate a task policy from demonstrations that do not agree, i.e., from inconsistent training data. 6.3.1 Experimental Design The graphical interface and L3D learning system were used to teach a robot the process of cooking mushroom risotto. The demonstration environment consisted of a simulated kitchen, PR2 robot, ingredients, and utensils. Simulation was used to reduce teaching time, provide a stable environment, and facilitate error correction through an undo feature that reverses instructions. The primary hypothesis tested in this study was that the instructors could teach the robot in simulation, and that later the learned skills could be transferred to a physical robot. The simulated environment was carefully designed to be realistic but not to replicate the environment in which the real robot later executed the learned task. The simulated environment contained all the ingredients required to make mushroom risotto and a few additional items, including salt, pepper, and red wine. None of the ingredients were pre-measured, however we made the assumption that the robot could 89 Mushroom Risotto Recipe Ingredients 2 cups chicken broth 3 tablespoons olive oil Portobello mushrooms, chopped shallots, chopped 2 cups rice 1 cup dry white wine chives, chopped 1/2 cup Parmesan cheese Directions 1. Warm 2 tablespoons olive oil in a large saucepan. 2. Add in the chopped mushrooms, and cook for 5 minutes. Remove mushrooms from saucepan and set aside. 3. Add 1 tablespoon olive oil to saucepan, and add in the chopped shallots. 4. Add rice to saucepan and stir. After 2 minutes pour in wine. 5. Add 1/2 cup chicken broth to the rice, and stir. Continue adding broth 1/2 cup at a time until 2 cups have been added, stirring rice between additions of the chicken broth. 6. Remove saucepan from heat, and add in mushrooms, chives, and Parmesan cheese. Stir. Figure 6.10: Mushroom risotto recipe used in the study. 90 pour exact amounts. A few utensils were also placed within the environment, including a bowl, saucepan, cutting board, knife, and spoon. A fteen-step tutorial guided participants through each component of the graphical interface. The tutorial was built directly into the graphical interface, allowing the participants to experiment with the interface prior to teaching the robot. Participants spent on average four minutes following the tutorial. Upon completion of the tutorial, participants received a printed mushroom risotto recipe, show in Figure 6.10 above. At that point, each participant was free to instruct the robot until they deemed the recipe complete. Following a demonstration, each participant completed a survey that collected demographic information. 6.3.2 Results Thirty one participants completed the demonstration, 24 male and 7 female, with an age range of 23 to 67. The average time to complete the demonstration was 27 minutes, with a standard deviation of 10 minutes. In contrast, a PR2 robot took on average 92.1 minutes to complete the task. On average each participant sent the robot 77 instructions with a standard deviation of 27.8 instructions. Three errors were generated during demonstrations and four uses of the undo feature occurred on average with standard deviation of 2.6 and 7.6 respectively. Errors occurred when an instructor requested the robot to perform an impossible action, such as picking up an object when both grippers already contained objects. As expected, the instructions sent to the robot were similar across participants in the beginning of the demonstration. The instructions diverged over time as participants chose to complete the recipe using a dierent order of instructions, see Figure 6.11 above. In uence diagrams generated from the demonstration data are shown in Figure 6.12 above. The diagrams are depicted in an abbreviated notation for clarity and to highlight 91 move to mushrooms move to knife 3 pick up mushrooms 4 move to olive oil pick up olive oil 22 move to chicken broth move to olive oil 1 move to sauce pan turn on stove 1 put olive oil 1 move to cutting board 1 pick up knife 2 3 move to knife 1 1 move to bowl 1 move to mushrooms 1 move to sauce pan 19 pick up sauce pan 1 move to mushrooms 1 pick up mushrooms 3 1 move to cutting board 1 2 pick up knife 1 move to sauce pan 1 1 pour olive oil 18 turn on stove 1 put sauce pan 1 put olive oil 1 move to knife 1 pick up mushrooms 3 pick up knife 2 1 move to sauce pan 1 move to cutting board 1 2 move to mushrooms 3 move to table 1 pick up sauce pan 1 turn on stove 11 1 1 move to olive oil 1 move to mushrooms 1 chop mushrooms 2 1 pick up knife 1 move to cutting board 2 move to stove 1 2 wait none 2 pick up mushrooms 1 pour olive oil 1 2 1 1 1 8 1 move to bowl 2 1 put olive oil 1 chop mushrooms 1 move to cutting board 1 3 pick up mushrooms 8 move to knife 1 1 1 1 move to stove 2 move to olive oil 1 put knife 1 2 1 put sauce pan 1 move to mushrooms 2 move to table 1 pick up bowl 1 1 put olive oil 1 move to olive oil 1 move to mushrooms 1 move to cutting board 1 1 pick up mushrooms 3 1 move to knife 1 6 pick up knife 1 1 chop mushrooms 1 move to shallots 1 1 1 turn on stove 1 1 1 1 1 1 1 1 1 put mushrooms 8 1 chop mushrooms 1 move to mushrooms 1 3 pick up knife 1 2 move to sauce pan 1 move to knife 1 move to cutting board 3 1 1 wait none 1 1 put bowl 1 pick up olive oil 1 move to olive oil 1 pick up shallots 1 1 pour olive oil 1 pick up knife 8 move to knife 1 move to sauce pan 2 move to cutting board 1 move to olive oil 1 move to mushrooms 1 1 1 1 1 1 put mushrooms 3 1 1 2 1 1 1 turn off stove 1 move to cutting board 1 chop mushrooms 8 move to olive oil 1 pick up knife 2 2 2 move to stove 1 wait none 1 pick up mushrooms 1 put olive oil 1 1 move to knife 1 1 2 1 1 2 1 1 put knife 2 move to cutting board 1 move to sauce pan 1 pick up sauce pan 1 wait none 1 put mushrooms 1 chop mushrooms 2 move to mushrooms 1 pour olive oil 1 1 1 4 6 2 1 1 1 1 2 2 1 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 Figure 6.11: The rst twelve instructions from all the teachers. Each edge is labeled with the number of occurrences. The highlighted path is the set of instructions most instructors provided to the robot. 92 turn on(stove) fetch_put(mushrooms, cutting board, on) fetch_pour(olive oil, sauce pan, 2tbs) move to(olive oil) pick up(olive oil) move to (sauce pan) fetch(olive oil, sauce pan) pour(olive oil) Risotto Fetch and Pour Fetch fetch(mushrooms, cutting board) put(mushrooms, cutting board, on) move to(mushrooms) pick up(mushrooms) move to (cutting board) Fetch Fetch and Put chop_feature(knife, mushrooms) Risotto Chop Feature pick up(knife) chop(mushrooms) put(knife, down) fetch_put(mushrooms, sauce pan, in) wait(5 min) fetch_pour(sauce pan, bowl, all) Fetch and Put Fetch Fetch and Pour Fetch Figure 6.12: Abbreviated hierarchical in uence diagrams for the risotto task. Each bubble indicates the action taken by the robot. The conditional probability and value tables have been removed for clarity. the hierarchical structure. Within the risotto task, the fetch skill was commonly used along with skills that put or pour the features that were fetched. Labels were manually assigned to these skills. Tasks other than risotto may also use these skills, as shown in the following study. Following the demonstration phase, the simulated PR2 executed the learned in u- ence diagrams autonomously. The rst thirteen commands produced by the in uence diagrams correctly controlled the robot through the initial steps of the risotto recipe. After the thirteen step horizon, the in uence diagrams began to produce incorrect re- sults, as evaluated by a human observer. Incorrect results included execution of incorrect actions, and inability to determine an action to execute. The rst type of error occurred when an in uence diagram encodes an inappropriate value function or conditional probability table. The second type of error occurred when value functions contained multiple actions with equal value, and the robot could not determine which action to use. In both cases, a human may intervene 93 Chicken broth Rice Olive oil Sauce pan Bowl Stove White wine Mushrooms (a) Front view of the kitchen environment. Ingredient were on the table and shelf behind the PR2. Cutting board Knife Mushrooms Rice (b) Rear view of the kitchen environment with the end of the table that held the knife and cutting board. Figure 6.13: The kitchen environment and PR2 executing the risotto task. Cups were used as ingredients to reduce grasping errors, and simplify object detection. The stove and sauce pan were also replaced with simpler objects. Additionally, the knife was protected with wood to prevent damage to the robot and people. 94 and provide the system with the correct behavior. This new information is incorporated directly into the decision network through the update process described in Section 3.2.2. For this study, a simple command line interface was used to correct the robot during the autonomous execution phase. When the PR2 executed an incorrect action, I paused the robot and provided correct information. The entire process took 11.4 minutes, and required twenty one corrections. After one round of corrections, the simulated PR2 was able to correctly execute the risotto task. The learned in uence diagrams were then executed on a real PR2 robot in a kitchen environment that mimicked the simulated version, see Figure 6.13 above. The envi- ronment contained the same set of ingredients and utensils as found in the simulated version with dierent physical form. Cups that the PR2 was capable of grasping were used to represent ingredients. A knife holder kept the blade safe and in a position that allowed the PR2 to pick it up, and a lightweight saucepan constructed of foam core prevented it from slipping out of the PR2's gripper. Both the initial demonstrations and task corrections conducted in simulation al- lowed the in uence diagrams to produce a control policy that accurately guided the PR2 through the risotto task. However, the PR2 did not execute the task perfectly, and required assistance in many instances. Object recognition is imperfect, which occa- sionally led the PR2 to incorrectly detect the state of the environment. This situation led the PR2 to incorrectly choose an action. The problem was alleviated by a human observer who had to stop and correct the robot. The grasping and manipulation system on the PR2 periodically failed to grasp ob- jects. If the failure resulted in the object remaining upright, then the PR2 tried again, since the post-condition of the grasping action required the object to be in the gripper. However, if the object was knocked over or on the the oor, human operator intervention was required. 95 The nal type of error occurred during navigation, when the robot either became stuck on an uneven section of the oor, which required manual operator intervention, or knocked an ingredient over with its arm while moving about the environment. When an object was knocked o the table, the robot continued with the task and eventually reached a point at which it was unable to proceed due to an invalid state. At that point the robot asked for operator help through a text message. 6.3.3 Discussion This study gathered demonstration data from a large group of participants using a simulated environment, aggregated the information into a set of in uence diagrams structured hierarchically, and executed the results on a robot in a similar real world environment. The simulated demonstration environment reduced teaching time, allowed instructors to quickly undo mistakes, and integrated the graphical teaching tools with the robot student. The initial set of in uence diagrams generated from demonstrations were not ca- pable of successfully executing the risotto task. This was due to the wide variety of demonstrations provided to the robot. Corrective information was given to the robot by the operator while it executed the task in simulation. The corrective information was directly integrated with the robot's in uence diagrams, which resulted in a accurate control policy. When used in a real world kitchen-like environment, the in uence diagrams produced results that matched those found in simulation. A highly dynamic environment caused the robot to fail. Recovery from unexpected changes to the world state was dicult to overcome, especially since the in uence diagrams were not shown those types of errors. Thus, the robot continued executing the task until it reached a point at which the state 96 Figure 6.14: Simulated environment that contains a typical kitchen and dining room. of the world did not match its expected state. Human intervention was then required to enable the robot continue task execution. The ability of in uence diagrams to continually integrate knew skills could oer a partial solution to unexpected state changes. Instead of a person correcting the world state, demonstrations could be used to show the robot how to correct the world state itself. These demonstrations could in turn be used to generated new in uence diagrams which the robot could use to self-recover. 6.4 Study 4: Skill Transfer Our denition of life-long learning requires a robot to both continually learn and inte- grate known skills with the current learning task. This nal study demonstrated how our L3D approach to life-long learning transfers skills between tasks. Specically, the study demonstrated how the system can make use of skills learned from the risotto task while learning an entirely dierent and new task, table setting. 97 As in the previous studies, the instructors used a graphical interface to teach a simulated PR2 robot how to set a table in an environment consisting of a kitchen and dinning room, see Figure 6.3.3 above. Two additional components were added to the graphical interface, as described in Section 5.6, which allowed the instructor to select previously learned tasks and to place a feature relative to another feature. 6.4.1 Experimental Design The simulation-based graphical interface was used to teach a PR2 how to set a table. The demonstration environment consisted of a kitchen and dining room environment. Plates, bowls, cups, and silverware rolls were located in the kitchen, and the dining room consisted of table with six placemats. Objects not directly relevant to the kitchen environment were also available to the instructors, including a stove, sink, and refriger- ator. The same actions used in the risotto task were available to the instructor, and an additional, super uous open action was incorporated. Instructors were presented with the same GUI tutorial used during the risotto task with the addition of a description of the automatic skill recognition element and place- ment features. Upon completion of the tutorial, each instructor was free to guide the robot through the table setting task. In uence diagrams were then generated from the demonstration data, and executed autonomously by the robot in simulation. 6.4.1.1 Hypothesis We hypothesize that access to previously learned tasks will reduce the time required to demonstrate a task to a robot. The number of instructional errors should also decrease because the robot will be executing more actions autonomously. Both of these 98 hypotheses are based on the assumption that instructors will utilize the previously learned tasks instead of only passing individual commands to the robot. Using the L3D approach, in uence diagrams learned in the previous study should be transferred to the table setting task. Instructors that use previously learned skills during the demonstration process inform the learning algorithm which in uence diagrams to use. 6.4.2 Results Ten volunteers participated in this study, with ages ranging from 24 to 33. Of the ten participants, seven were female and three male. An additional ve participants, three male and two female, participated in the study. These ve participants did not have access to previously learned tasks. Instead they had to send all instructions to the robot. The results from these ve participants were used only as a time comparision to participants who had access to previously learned tasks. The average duration of a demonstration for the rst ten participants was 21 minutes, with a standard deviation of 3.1 minutes. The last ve participants acheived an average duration of 29.4 minutes, with a standard deviation of 4.3 minutes. Access to previously learned tasks successfully reduced the time required to complete a demonstration. This is assuming that relevant previously learned tasks exist. Each instructor used 54 instructions on average, with a standard deviation of 12.3. The use of the undo feature only occurred three times by two dierent instructors. Errors generated by the instructors occurred ve times by two instructors. The infrequent use of the undo feature and relatively few errors also indicate the advantage of using previously learned features. The greater portion of a task executed autonomously, the fewer demonstration errors. 99 The primary skill used during demonstrations consisted of fetching tableware and placing it on the dining room table. As a result the fetch skill learned from the risotto task was heavily used by the instructors, for an average of eight times per instructor. Every instructor assumed that each table place should be set with each piece of the tableware, even though this was not explicitly stated. Placement order of the tableware and relative locations were the primary dierences between instructors. Three instructors placed all tableware of the same type rst, while the remainder fully set each table setting before moving to the next one. 6.4.3 Discussion The drastic dierences between instructors highlight a diculty associated with LfD when the order of operations is not important. Our approach, along with most others, favors the solution most often demonstrated. However, this is not the ideal solution, since individual preferences are important. We address this problem by associating in uence diagrams to instructors. Demonstration provided by an instructor are weighted heavily, and are used to produce in uence diagrams that are tailored for the specic instructor. These diagrams are then used when the same instructor demonstrates a new task or has the robot execute a task autonomously. A simple login system can identify an instructor, and load the relevant in uence diagrams. Our study did not store instructor identity, as this requirement is a result of the study. Instead, learned diagrams were assigned unique integers. Individual in uence diagrams still allow for task reuse, even between instructors. The fetch skill was used by every instructor, and in most cases followed by a put down instruction. As a result, the fetch and put in uence diagram learned in the previous study was transferred to the table setting task using the process describe in Section 4.1.2. 100 The ability to utilize previously learned tasks showed a reduction in demonstration duration and number of errors, which matched our hypothesis. This result is contin- gent on the availability of relevant previously learned tasks. The graphical interface performed a key role in this result by providing a tutorial exercise that described how to use learned tasks. Tied with the demonstration time reduction is a reduction in commands sent to the robot, which in turn reduces the number of instructional errors. 6.5 Summary This chapter described the four studies conducted to evaluate the presented life-long learning from demonstrations, the L3D approach. The rst study showed that instruc- tors are negatively aected by directly watching a robot perform a task. The average time to complete a demonstration and the number of commands sent to the robot both increased for visible teachers. Based on observations and comments, this result appears to stem from the participants' fascination with watching the robot move. This indicated that an additional mode of communication is necessary in order to keep the instructors focused on the process of providing a demonstration. Audio cues were added to the robot in the second study. These cues were triggered whenever the internal state of the robot changed. As a result, the instructors were better informed about the robot student. This caused the participants to use fewer commands when instructing the robot. The next study used the actions and observations collected by the robot during the Towers of Hanoi demonstrations. These data were used to generate an in uence diagram capable of solving the Towers of Hanoi puzzle. Without any modications, the learned in uence diagram was capable of nding a solution only in the states that were shown during the demonstrations. With additional training, the same in uence diagram was improved to solve the puzzle from any valid state. 101 The previous studies were conducted in constrained environments where instructors were constrained by the range of commands the robot could execute and the features with which the robot could interact. A more complex scenario was designed that in- volved teaching a PR2 robot how to cook mushroom risotto. The results show that sig- nicant dierences exist between the way instructors completed the task. An additional step of correcting the PR2 while it attempted to execute the risotto task autonomously was required to generate valid results. This additional step may be used at any time to provide a robot with corrective information, without the need to conduct a complete task demonstration. The nal study utilized skill transfer, from skills generated from the risotto task to a table setting task. Instructors were given the ability to choose skills the robot could execute autonomously. This ability reduced demonstration time, and allowed the robot to concretely reuse prior knowledge in a new task. The table setting task also highlighted how individual preferences among instructors play a role in how task are demonstrated. Rather than generalize over all demonstrations and eliminate individual preferences, we created unique in uence diagrams for each instructor. The results from each study informed the next and collectively demonstrate the complexities associated with task learning from demonstration. There is signicant variablity in the way people provide demonstrations, and this variability should be used to provide individuals with a robot tailored to their needs. This still leaves room for knowledge reuse across tasks, and the ability to continually learn new and more complex tasks as the robot's skill set grows. 102 Chapter 7 Summary The central motivation of this work has been the idea that robots will become an integral part of our lives and as such that they need to both continually learn and adapt to individual users' needs. Current technology, both hardware and software, allows a robot to achieve a wide range of tasks. However, skilled experts are required to program the robot. This currently limits commercial robots to a small set of pre-programmed tasks. We have detailed an approach that integrates learning from demonstration (LfD) and life-long learning into life-long learning from demonstration (L3D), designed to allow a robot to learn tasks from people who have no experience using or interacting with robots. Our approach utilizes an intuitive graphical interface overlayed on a simulated environment as a tool through which people can teach a real-world robot household and other tasks. 103 7.1 Life-Long Learning Life-long learning requires a robot to acquire new information over its lifetime, and to use this information to facilitate future learning and improve performance. The approach to life-long learning described in this work utilizes in uence diagrams, a exible and hierarchical task representation that allows a robot to use task information across multiple tasks. Life-long learning is complementary with LfD by incorporating previously learned tasks into new training demonstrations which, in turn, reduces demonstration duration. An important aspect of life-long learning is facilitating knowledge reuse. We accomplish this by allowing instructors to directly use previously learned tasks via the graphical teaching interface. Additionally, the system itself can identify in uence diagrams that match training data, and incorporate the appropriate in uence diagrams during the learning phase. 7.2 In uence Diagrams In our approach, the demonstration data collected by the robot in the course of the teaching session(s) are combined into in uence diagrams, a compact and hierarchical representation of tasks. Our approach to learning in uence diagrams utilizes a few key constraints, including a one-to-one link between decision nodes and value nodes, and knowledge of the available robot actions. These constraints allow us to convert in uence diagrams to Bayesian networks, generate the structure and parameters of the network through a structural EM algorithm, and convert back to an in uence diagram for later execution. 104 7.3 Simulation-based Teaching Simulation has proved to be a useful tool for LfD. A simulated robot can operate faster than in real-time, reducing demonstration time and instructor fatigue. Additionally, errors in a demonstration can be easily corrected by the instructor during a demon- stration by undoing commands. With the tie between the demonstration process and the physical robot removed, demonstrations can be provided by more instructors over less time. The primary drawback of this approach is inherent to simulation: simulation technology does not accurately model all the dynamic properties of the real world. This limitation may lead to demonstrations that work in simulation but not on a physical robot. 7.4 Limitations Our approach does not lend itself to learning and executing policies that require fast control loops. However, it is our belief that multiple dierent algorithms should be integrated in order to realize a robot capable of learning everything from motion strate- gies to task plans. Well designed interfaces that facilitate integration of multiple robot control and learning approaches is within reach. We hope that this work identies a compelling approach to task level learning, one that can mesh well with other current and future learning and human-robot interaction techniques. 7.5 Future Work 7.5.1 Task Classication In uence diagrams used to represent tasks, as described in Chapter 3, provide avenues for future work in task classication. As an example, consider a person who is watching 105 a video of a chef mixing ingredients for making a cake. The viewer will use prior experience, based on observations or direct experience, to classify the chef's actions as "baking". A robot with a similar classication ability could identify observed behavior, organize learned task models in a logical manner, and potentially integrate observed actions into stored in uence diagrams. An analytical approach to task classication, based on in uence diagrams, could identity many similar similar tasks. Two or more tasks may be compared using a degree of similarity between the expected actions and features, in much the same way our approach compares in uence diagrams to actions provided by an instructor during demonstrations as described in Section 4.1. However, this approach could produce incorrect results for tasks that have dierent actions but fall into the same category. For example, sweeping and mopping have dierent sets of actions, but both can be classied as cleaning. Keywords, in the form of task meta-data, could be assigned to a in uence diagram to help provide the additional information necessary to correctly classify tasks. For example, in uence diagrams for sweeping and mopping could have the overlapping key- words of cleaning and oor. Instructors could provide these keywords at the end of a demonstration, in much the same way videos and pictures are now tagged on web sites. Finally, forward simulation of a stored task may be used as a comparison tool against an observed task. A classier may use the ow of features over time as a metric. For example, the repetitive back and forth motion of a duster may match the motion of a known sweeping task. Analytical in uence diagram comparison, access to human generated meta-data, and forward simulation may provide the fundamental components of a task classication algorithm. Such an algorithm would help a robot to categorize learned tasks, identify where knowledge reuse is applicable, and allow it to make intelligent assumptions about 106 the tasks being performed by observed people or robots. This approach will be further enhanced with the ability to learn tasks in an on-line manner. 7.5.2 On-line Learning A drawback to our approach is the required o-line learning phase. During this phase demonstration data is processed to generate in uence diagrams. This o-line require- ment prevents the robot from actively incorporating new knowledge into the teaching process, and thereby showing the instructor that the teaching process is producing a meaningful outcome. A method to eciently generate in uence diagrams in an on-line manner would benet the teaching process, and reduce the down-time between demonstrations. A starting point for an on-line learning algorithm would be to run the learning algorithm as a background process that continually integrates new demonstration data and feeds the resulting information back into the graphical interface. 107 Bibliography Amazon. Mechanical Turk, 2010. URL https://www.mturk.com. R. C. Arkin. Behavior-Based Robotics. MIT Press, May 1998. R. C. Arkin and T. R. Balch. Aura: Principles and practice in review. Journal of Experimental and Theoretical Articial Intelligence, 9(2):175{188, April 1997. C. Atkeson and S. Schaal. Robot learning from demonstration. In Proceedings of the Fourteenth International Conference on Machine Learning, 1997. D. Bentivegna. Learning from Observation Using Primitives. PhD thesis, College of Computing, Georgia Institute of Technology, Atlanta, GA, July 2004. D. C. Bentivegna, A. Ude, C. G. Atkenson, and G. Cheng. Humanoid robot learning and game playing using pc-based vision. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2002. A. Billard and M. J. Matari c. Learning human arm movements by imitation: Evalua- tion of a biologically inspired connectionist architecture. Robotics and Autonomous Systems, 37(2):145{160, Nov 2001. A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot Programming by Demonstra- tion, chapter 59. MIT Press, 2008. M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus. Interpreting and executing recipes with a cooking robot. In International Symposium on Experimental Robotics (ISER), Quebec City, Canada, 2012. C. Breazeal, D. Buchsbaum, J. Gray, and B. Blumberg. Learning from and about others: Towards using imitation to bootstrap the social competence of robots. Articial Life, 11(1-2):31{62, 2005. J. G. Bremner. Infancy. Wiley-Blackwell, Malden, MA, 1994. R. A. Brooks. A robust layered control system for a mobile robot. IEEE Transactions on Robotics and Automation, 2(1):14{23, April 1986. 108 C. Brunner, T. Peynot, and J. Underwood. Towards discrimination of challenging conditions for ugvs with visual and infrared sensors'. In Proceedings of 2009 Aus- tralasian Conference on Robotics and Automation (ACRA), Sydney, Australia, Dece- meber 2009. A. Chella, H. Dindo, and I. Infantino. A cognitive framework for imitation learning. Robotics and Autonomous Systems, Special Issue on The Social Mechanisms of Robot Programming by Demonstration, 54(5):403{408, 2006. J. R. Chen. Constructing task-level assembly strategies in robot programming by demon- stration. International Journal of Robotics Research, 24(12):1073{1085, December 2005. S. Chernova and M. Veloso. Condence-based policy learning from demonstration using gaussian mixture models. In International Conference on Autonomous Agents and Multiagent Systems, May 2007. S. Chernova and M. Veloso. Teaching collaborative multi-robot tasks through demon- stration. In IEEE-RAS International Conference on Humanoid Robots, Daejeon, Korea, December 2008. W. Churchill and P. Newman. Practice makes perfect? managing and leveraging vi- sual experiences for lifelong navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA2012), Minnesota, USA, May 2012. M. Ciocarlie. Ros table top perception, 8 2012. URL http://www.ros.org/wiki/tabletop object perception. H. H. Clark. Understanding Language. Cambridge University Press, 1996. H. H. Clark and S. A. Brennan. Grounding in communication. In L. B. Resnick, J. M. Levine, and S. D. Teasley, editors, Perspectives on socially shared cognition. APA Books, 1991. J. Connell. Sss: A hybrid architecture applied to robot navigation. In Proceedings of IEEE Internationl Conference on Robotics and Automation, pages 2719{2724, Nice, France, May 1992. E. Cost-Maniere and R. Simmons. Architecture the backbone of robotic systems. In Proceedings of the IEEE International Conference on Robitics and automation, pages 67{72, San Franscico, CA, April 2000. J. Cross. Informal Learning: Rediscovering the Natural Pathways That Inspire Innova- tion and Performance. Pfeier, november 2006. 109 R. Dillmann, M. Kaiser, and A. Ude. Acquisition of elementary robot skills from human demonstration. In International Symposium on Intelligent Robotic Systems, 1995. M. R. Endsley. Design and evaluation for situation awareness enhancement. In Proceed- ings of the Human Factors Society 32nd Annual Meeting, Santa Monica, CA, 1988. Human Factors Society. J. Firby. Adaptive Execution in Complex Dynamic Worlds. PhD thesis, Yale University, January 1989. H. S. G. and S. L. E. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In H. P. A. and M. N., editors, Human mental workload, pages 139{183. Elsevier, Amsterdam, 1988. A. Garland and N. Lesh. Learning hierarchical taskmodels by demonstration. Techni- cal Report Technical Report TR2003-01, Mitsubishi Electric Research Laboratories, 2003. E. Gat. A language for supporting robust plan execution in embedded autonomous agents. In L. Pryor, editor, Proceedings of the AAAI Fall Symposium on Plan Exe- cution. AAAI Presss, 1996. M. George and A. Lanksy. Reactive reasoning and planning. In Proceedings of National Conference on Articial Intelligence, pages 972{978, Seattle, WA, 1987. B. Gerkey, R. T. Vaughan, and A. Howard. The player/stage project: Tools for multi- robot and distributed sensor systems. In Proceedings of the 11th International Con- ference on Advanced Robotics, pages 317{323, Coimbra, Portugal, June 2003. R. Gockley, J. Forlizzi, and R. Simmons. Interactions with a moody robot. In Pro- ceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction, pages 186{193, New York, NY, USA, 2006. ACM. K. Gold. An information pipeline model of human-robot interaction. In Proceed- ings of the 4th ACM/IEEE international conference on Human robot interaction, pages 85{92, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-404-1. doi: http://doi.acm.org/10.1145/1514095.1514113. D. Grollman and O. Jenkins. Dogged learning for robots. In Proceedings of the IEEE International Conference on Robotics and Automation, 2007a. D. H. Grollman and O. C. Jenkins. Learning robot soccer skills from demonstration. In International Conference on Development and Learning, pages 276{281, London, UK, July 2007b. 110 D. H. Grollman and O. C. Jenkins. Sparse incremental learning for interactive robot control policy estimation. In International Conference on Robotics and Automation, pages 3315{3320, Pasadena, CA, USA, may 2008. M. Hentschel and B. Wagner. An adaptive memory model for long-term navigation of autonomous mobile robots. Journal of Robotics, 2011, 2011. M. Hersch, F. Guenter, S. Calinon, and A. Billard. Dynamical System Modulation for Robot Learning via Kinesthetic Demonstrations. IEEE Transactions on Robotics, 24 (6):1463{1467, 2008. G. Hovland, P. Sikka, and B. McCarragher. Skill acquisition from human demonstration using a hidden markov model. In Proceedings of the IEEE International Conference on Robotics and Automation, 1996. J. A. Ijspeert, J. Nakanishi, and S. Schaal. movement imitation with nonlinear dy- namical systems in humanoid robots. In international conference on robotics and automation (icra2002), 2002. T. Inamura, M. Inaba, and H. Inoue. Acquisition of probabilistic behavior decision model based on the interactive teaching method. In Proceedings of the Ninth Inter- national Conference on Advanced Robotics, 1999. C. Jensen, S. D. Farnham, S. M. Drucker, and P. Kollock. The eect of communica- tion modality on cooperation in online environments. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 470{477, New York, NY, USA, 2000. ACM. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Articial Intelligence Research, 4:237{285, 1996. S. Kiesler and J. Goetz. Mental models of robotic assistants. In CHI '02: extended abstracts on Human factors in computing systems, pages 576{577, New York, NY, USA, 2002. ACM. E. S. Kim, D. Leyzberg, K. M. Tsui, and B. Scassellati. How people talk when teaching a robot. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction, pages 23{30, New York, NY, USA, 2009. ACM. N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi- robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149{2154, Sendai, Japan, Sep 2004. N. Koenig, L. Takayama, and M. Matari c. Communication and knowledge sharing in human-robot interaction. Neural Networks, 2010. 111 K. Konolige and K. Myers. The saphira architecture for autonomous mobile robots. Articial intelligence and mobile robots: case studies of successful robot systems, pages 211{242, 1998. Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transaction on Robotics and Automation, 10:799{822, 1994. J. Lazar, A. Jones, K. Bessiere, I. Ceaparu, and B. Shneiderman. Workplace user frustration with computers: An exploratory investigation of the causes and severity. Behaviour & Information Technology, 25(3):239{251, april 2004. T. Lozano-P erez. Task Planning, pages 474{498. MIT Press, 1982. T. Lozano-P erez, J. L. Jones, E. Mazer, and P. A. O'Donnell. Task-level planning of pick-and-place robot motions. Computer, 22(3):21{29, 1989. M. J. Matari c. Behavior-based control: Examples from navigation, learning, and group behavior. Journal of Experimental and Theoretical Articial Intelligence, 9(2), 1997. W. Meeussen, E. Marder-Eppstein, K. Watts, and B. P. Gerkey. Long term autonomy in oce environments. In ICRA 2011 Workshop on Long-term Autonomy, Shanghai, China, 05/2011 2011. IEEE, IEEE. B. A. Myers. Creating user interfaces using programming by example, visual program- ming, and constraints. ACM Transactions Program. Lang. Syst., 12(2):143{177, 1990. J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and M. Kawato. learning from demonstration and adaptation of biped locomotion. robotics and autonomous systems, 1(2-3):79{91, 2004. S. Narasimhan. Task Level Strategies for Robots. PhD thesis, Massachusetts Institute of Technology, 1994. M. Nicolescu and M. Matari c. Methods for robot task learning: Demonstrations, gen- eralization and practice. In Proceedings of the Second International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2003. M. Pardowitz and R. Dillmann. Towards life-long learning in household robots: The pi- agetian approach. In IEEE 6th International Conference on Development and Learn- ing, pages 88{93, 2007. P. Pastor, H. Homann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In international conference on robotics and automation, 2009. 112 J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, CA, 1988. J. Peters, S. Vijayakumar, and S. Schaal. reinforcement learning for humanoid robotics. In ieee-ras international conference on humanoid robots (humanoids2003), 2003. N. Pollard and J. K. Hodgins. Generalizing demonstrated manipulation tasks. In Workshop on the Algorithmic Foundations of Robotics, 2002. D. Pomerleau. Ecient training of articial neural networks for autonomous navigation. Neural Computation, 3(1):88{97, 1991. A. Powers and S. Kiesler. The advisor robot: tracing people's mental model from a robot's physical attributes. In Proceedings of the 1st ACM SIGCHI/SIGART confer- ence on Human-robot interaction, pages 218{225, New York, NY, USA, 2006. ACM. S. Preston and F. B. M. de Waal. Empathy: Its ultimate and proximate bases. Behav- ioral and Brain Sciences, 25(1):1{72, 2002. M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng. Ros: an open-source robot operating system. In ICRA Workshop on Open Source Software, 2009. A. Ramesh and M. J. Matari c. Learning movement sequences from demonstration. In Proceedings of the International Conference on Development and Learning, pages 203{208, MIT, Cambridge, MA, Jun 2002a. A. Ramesh and M. J. Matari c. Learning movement sequences from demonstration. In Proceedings of the International Conference on Development and Learning, pages 203{208, MIT, Cambridge, MA, June 2002b. G. Rizzolatti and C. Sinigaglia. Mirrors in the Brain: How our minds share actions, emotions, and experience. Oxford University Press, Feb 2008. E. Rocco. Trust breaks down in electronic contexts but can be repaired by some initial face-to-face contact. In Proceedings of CHI '98, pages 496{502, Los Angeles, CA, April 1998. M. Rosenstein and A. Barto. Supervised actor-critic reinforcement learning, chapter 14. John Wiley and Sons Inc., New York, NY, USA, 2004. S. Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, 1998. L. T. S. Paepke. Juding a bot by its cover: An experiment on expectation setting for personal robots. In Proceedings on the 5th ACM/IEEE Internation Conference on Human-Robot Interaction, Osaka, Japan, March 2010. 113 C. Sammut, S. Hurst, D. Kedzier, and D. Michie. Learning to y. In Proceedings of the Ninth International Workshop on Machine Learning, 1992. S.-L. L. Sara, S. Kiesler, I. Yee-man, and L. C. yue Chiu. Human mental models of humanoid robots. In IEEE International Conference on Robotics and Automation, Barcelona, Spain, April 2005. S. Schaal. Is imitation learning the route to humanoid robotics? Trends in Cognitive Sciences, 3(6):233{242, 1999. S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. control, planning, learning, and imitation with dynamic movement primitives. In workshop on bilateral paradigms on humans and humanoids, ieee international conference on intelligent robots and systems (iros 2003), 2003. S. Schaal, A. Ijspeert, and A. Billard. Computational approaches to motor learning by imitation, pages 199{218. Number 1431. Oxford University Press, 2004. G. Schrott. An experimental environment for task-level programming of robots. In Pro- ceedings of the 2nd Int. Symposium on Experimental Robotics, pages 25{27. Springer, 1991. A. M. Segre. Machine learning of robot assembly plans. Kluwer Academic Publishers, 1988. R. D. Shachter. Probabilistic inference and in uence diagrams. Operations Research, 36:589{604, 1988. B. Shimano, C. Geschke, and C. Spalding. Val-ii: A robot programming language and control system. In Proceedings of SME Robots VIII Conference, Detroit, MI, USA, June 1984. R. Simmons. Structured control for autonomous robots. IEEE Transactions on Robotics and Automation, 10(1):34{43, June 1994. P. Soetens. A Software Framework for Real-Time and Distributed Robot and Machine Control. PhD thesis, Department of Mechanical Engineering, Katholieke Universiteit Leuven, Belgium, May 2006. R. Sutton. Reinforcement Learning. MIT Press, 1998. A. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Proceedings of the 21st National Conference on Articial Intelligence, 2006a. 114 A. L. Thomaz and C. Breazeal. Reinforcement learning with human teachers: Under- standing how people want to teach robots. In Proceedings of the 15th IEEE Interna- tional Symposium on Robot and Human Interactive Communication, 2006b. S. Thrun and T. Mitchell. Lifelong robot learning. Robotics and Autonomous Systems, 15:25{46, 1995. S. Vijayakumar and S. Schaal. Locally weighted projection regression: An o(n) algo- rithm for incremental real time learning in high dimensional space. In Proceedings of the 17th International Conference on Machine Learning, 2000. M. L. Visinsky, J. R. Cavallaro, and I. D. Walker. Expert system framework for fault detection and fault tolerance in robotics. Computers Electrical Engineering, 20(5): 421{435, 1994. A. Weiss, J. Igelsboeck, S. Calinon, A. Billard, and M. Tscheligi. Teaching a Humanoid: A User Study on Learning by Demonstration with HOAP-3. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication, pages 147{152, Sept. 2009. A. Whitbrook. Programming Mobile Robots with Aria and Player. Springer, 2010. C. Zixing. An expert system for robot transfer planning. Journal of Computer Science and Technology, 3(2):153{160, April 1988. 115
Abstract (if available)
Abstract
Programming a robot to act intelligently is a challenging endeavor that is beyond the skill level of most people. Trained roboticists generally program robots for a single purpose. Enabling robots to be programmed by non-experts and to perform multiple tasks are both open challenges in robotics. The contributions of this work include a framework that allows a robot to learn tasks from demonstrations over the course of its functional lifetime, a task representation that uses Bayesian decision networks, and a method to transfer knowledge between similar tasks. The demonstration framework allows non-experts to demonstrate tasks to the robot in an intuitive manner. ❧ In this work, tasks are complex time-extended decision processes that make use of a set of predefined basis behaviors for actuator control. Demonstrations from an instructor provide the necessary information for the robot to learn a control policy. An instructor guides the robot through a demonstration using a graphical interface that displays information from the robot and provides an intuitive action-object pairing mechanism to issue commands to the robot. ❧ Each task is represented by an influence diagram, a generalization of Bayesian networks. The networks are human readable, compact, and have a simple refinement process. They are not subject to an exponential growth in states or in branches, and can be combined hierarchically, allowing for complex task models. Data from task demonstrations are used to learn the structure and utility functions of an influence diagram. A score-based learning algorithm is used to search through potential networks in order to find an optimal structure. ❧ Both the means by which demonstrations are provided to the robot and the learned tasks are validated. Different communication modalities and environmental factors are analyzed in a set of user studies. The studies feature both engineer and non-engineer users instructing the Willow Garage PR2 on four tasks: Tower's of Hanoi, box sorting, cooking risotto, and table setting. The results validate that the approach enables the robot to learn complex tasks from a variety of teachers, refining those tasks during on-line performance, successfully completing the tasks in different environments, and transferring knowledge from one task to another.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Coordinating social communication in human-robot task collaborations
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Socially assistive and service robotics for older adults: methodologies for motivating exercise and following spatial language instructions in discourse
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Nonverbal communication for non-humanoid robots
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Algorithms and systems for continual robot learning
PDF
Learning objective functions for autonomous motion generation
PDF
Adaptive sampling with a robotic sensor network
PDF
Accelerating robot manipulation using demonstrations
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Learning from planners to enable new robot capabilities
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Scaling robot learning with skills
PDF
Closing the reality gap via simulation-based inference and control
PDF
Data-driven acquisition of closed-loop robotic skills
Asset Metadata
Creator
Koenig, Nathan
(author)
Core Title
Robot life-long task learning from human demonstrations: a Bayesian approach
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (Robotics and Automation)
Publication Date
02/26/2013
Defense Date
09/11/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bayesian networks,influence diagrams,learning from demonstration,life-long learning,OAI-PMH Harvest,robotics,teaching
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mataric, Maja J. (
committee chair
), Jain, Rahul (
committee member
), Sukhatme, Gaurav S. (
committee member
)
Creator Email
natekoenig@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-222240
Unique identifier
UC11293484
Identifier
usctheses-c3-222240 (legacy record id)
Legacy Identifier
etd-KoenigNath-1451.pdf
Dmrecord
222240
Document Type
Dissertation
Rights
Koenig, Nathan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bayesian networks
influence diagrams
learning from demonstration
life-long learning
robotics