Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
(USC Thesis Other)
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2023 Bingling Huang Reward Shaping and Social Learning in Self- organizing Systems through Multi-agent Reinforcement Learning by Bingling Huang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MECHANICAL ENGINEERING) August 2023 ii Acknowledgments I am sincerely grateful for the invaluable opportunity to undertake research at USC and write this thesis. The years spent pursuing my Ph.D. hold a special place in my heart, representing the most challenging period and the most significant accomplishment of my life thus far. These experiences have not only honed my critical thinking skills but also molded me into a person I never thought possible. I truly benefit from this remarkable journey of academic and personal growth. First of all, I would like to extend my deepest gratitude to my advisor, Prof. Yan Jin, for his exceptional guidance and support and boundless patience throughout the entire research process. Prof. Jin's expertise in advanced design and vast knowledge across multiple disciplines has nurtured my growth toward to a researcher, allowing me to mature in both skill and mindset. He not only mentors me in technical matters but also leads me to think independently and build my own research map. Without his guidance, I would not have been able to take a single step forward in this endeavor. In addition, I would like to express my heartfelt gratitude to the esteemed members of my proposal and dissertation committee: Prof. Ivan Bermejo-Moreno, Prof. Aiichiro Nakano, Prof. Geoffrey R Shiflett, and Prof. Mitul Luhar. Their invaluable guidance, constructive feedback, and suggestions have greatly enhanced the quality of my research. I truly appreciate their encouragement and contributions. Furthermore, I would like to thank my friends in IMPACT Laboratory at USC, namely Dr. Hao Ji, Dr. Zijian Zhang, Dr. Hristina Milojevic, Edwin Williams, Yunjian Qiu, Chuanhui Hu, and Xinrui iii Wang. Their selfless contributions in sharing their experiences, knowledge, and suggestions have helped me greatly in advancing my research work. Lastly, I want to express my thanks to my parents and friends, whose support has been a source of strength for me. Their unconditional love, encouragement, and constant presence have played a pivotal role in my academic pursuits. I am forever grateful for their steadfast companionship and support. Bingling Huang, June 2023. iv Table of Contents Acknowledgement ......................................................................................................................... ii List of Tables ................................................................................................................................ vi List of Figures .............................................................................................................................. vii Abstract ................................................................................................................................ ix Chapter 1. Introduction ............................................................................................................. 1 1.1. Background and Motivations........................................................................................................................... 1 1.2. Mechanical Engineering Design ....................................................................................................................... 9 1.3. Multiagent System and Self-organization ..................................................................................................... 12 1.4. Deep Reinforcement Learning Approach ...................................................................................................... 14 1.5. Research Questions ....................................................................................................................................... 17 1.6. Overview ........................................................................................................................................................ 18 Chapter 2. Related Work ........................................................................................................ 19 2.1. Multiagent Reinforcement Learning ............................................................................................................. 19 2.2. Heuristic Knowledge in MARL........................................................................................................................ 21 2.3. Reward Function and Shaping Reward Design .............................................................................................. 24 2.4. Complexity Measurement ............................................................................................................................. 28 2.5. Social Learning in MARL ................................................................................................................................. 31 2.6. Knowledge Transfer Learning ........................................................................................................................ 35 Chapter 3. Methodology ......................................................................................................... 37 3.1. Reinforcement Learning ................................................................................................................................ 37 3.2. Multiagent Reinforcement Learning ............................................................................................................. 38 3.3. MASo-DQL: A Framework of Social Learning in MARL .................................................................................. 40 3.4. Design of Shaping Reward Fields ................................................................................................................... 48 3.5. Complexity Measurement ............................................................................................................................. 51 v Chapter 4. Reward Shaping and its Impact on Self-organizing Systems ............................... 55 4.1. Introduction ................................................................................................................................................... 55 4.2. Task Description ............................................................................................................................................. 56 4.3. MARL Modeling and Reward Function Design .............................................................................................. 57 4.4. Experiment Design ......................................................................................................................................... 65 4.5. Results and Discussion ................................................................................................................................... 69 4.6. Summary and Findings................................................................................................................................... 85 Chapter 5. Social Learning in MARL .................................................................................... 87 5.1. Introduction ................................................................................................................................................... 87 5.2. Social Learning Modeling ............................................................................................................................... 89 5.3. Experiment Design ......................................................................................................................................... 94 5.4. Results and Discussion ................................................................................................................................. 108 5.5. Summary and Findings................................................................................................................................. 125 Chapter 6. Knowledge Transfer in SOSs and Impact of Social Abilities............................. 128 6.1. Introduction ................................................................................................................................................. 128 6.2. Knowledge Transfer Schema ....................................................................................................................... 129 6.3. Experiment Design ....................................................................................................................................... 130 6.4. Results and Discussion ................................................................................................................................. 132 6.5. Summary and Findings................................................................................................................................. 141 Chapter 7. Conclusions ......................................................................................................... 142 7.1. Summary ...................................................................................................................................................... 142 7.2. Contributions ............................................................................................................................................... 145 7.3. Future Directions ......................................................................................................................................... 146 Bibliography ............................................................................................................................. 149 vi List of Tables Table 1. Task model summary ..................................................................................................................................... 31 Table 2. Pseudo code of Multiagent Deep Q-learning (MA-DQL) ............................................................................... 39 Table 3. Notation and definition of MASo-DQL. .......................................................................................................... 45 Table 4. Pseudo code of Multiagent Social Deep Q-learning (MASo-DQL) ................................................................. 47 Table 5. CCFs in assembly tasks involving collision avoidance. ................................................................................... 52 Table 6. Environment settings in task: “L-shape” assembly. ....................................................................................... 56 Table 7. Hyperparameter settings. .............................................................................................................................. 58 Table 8. Hyperparameter settings. .............................................................................................................................. 89 Table 9. Notations in MASo-DQL in NS, ON2, ON4, and ON5. .................................................................................... 94 Table 10. Environment setting in “L”-assembly tasks with collision avoidance. ......................................................... 96 Table 11. Task complexity contributory factors (CCFs). .............................................................................................. 98 Table 12. Task complexity estimations. ....................................................................................................................... 98 Table 13. Task complexity estimations: calculation processes. .................................................................................. 99 Table 14. Team energy efficiency evaluation matrix. ................................................................................................ 102 Table 15. Team energy efficiency evaluation area illustration.................................................................................. 104 Table 16. Teamwork division. .................................................................................................................................... 107 Table 17. Team energy efficiency evaluation metric. ................................................................................................ 125 vii List of Figures Figure 1. Structure of cFORE. ......................................................................................................................................... 3 Figure 2. Chang Chen’s approach to designing a CSO system. ...................................................................................... 4 Figure 3. Three layers of abstraction to the Meta-Interaction Model. ......................................................................... 5 Figure 4. MARL structures.............................................................................................................................................. 8 Figure 5. Heuristic knowledge in MARL. ...................................................................................................................... 22 Figure 6. Task view and social views of a robot. .......................................................................................................... 42 Figure 7. Distributed map of social levels. ................................................................................................................... 42 Figure 8. Social level examples. ................................................................................................................................... 43 Figure 9. Simulation system architecture of MASo-DQL. ............................................................................................ 47 Figure 10. Task of ‘L-shape’ assembly introduction. ................................................................................................... 56 Figure 11. Box self-angle α and relative angle β. ......................................................................................................... 60 Figure 12. Illustrations of reward shaping fields P1, P4 and P5. ................................................................................. 63 Figure 13. Illustrations of reward shaping fields P2 and P3. ....................................................................................... 64 Figure 14. Experiment design for the shaping reward study. ..................................................................................... 65 Figure 15. Illustration of final configuration scatter plots in the form of angle 𝛼 and angle 𝛽 ................................... 68 Figure 16. Final configurations for 3-agent teams ....................................................................................................... 69 Figure 17. Final configurations for 5-agent teams ....................................................................................................... 70 Figure 18. Final configurations for 7-agent teams ....................................................................................................... 70 Figure 19. Final configurations for 9-agent teams ....................................................................................................... 71 Figure 20. Convex shaping reward functions with different gradients. ...................................................................... 74 Figure 21. Final configurations for 3-agent teams:...................................................................................................... 75 Figure 22. Final configurations for 5-agent teams ....................................................................................................... 76 Figure 23. Final configurations for 7-agent teams ....................................................................................................... 77 Figure 24. Final configurations for 9-agent teams ....................................................................................................... 78 Figure 25. Euclidean distance boxplot of training results with different reward shaping fields................................. 79 Figure 26. Mean and standard deviation of Euclidean distance with different reward shaping fields. ..................... 81 Figure 27. Task performance comparison with different team sizes. ......................................................................... 82 Figure 28. Learning curve comparison of different team sizes with P3-fields. ........................................................... 84 Figure 29. Area division around the dynamic box. ...................................................................................................... 92 Figure 30. Agent with different social abilities in the context of box-push task. ........................................................ 93 Figure 31. Experiment design for the social learning study. ....................................................................................... 95 Figure 32. New tasks illustrations. ............................................................................................................................... 96 Figure 33. Team energy efficiency area division. ...................................................................................................... 104 Figure 34. Team strategy examples in different areas. ............................................................................................. 105 viii Figure 35. Conflicted action pairs. ............................................................................................................................. 106 Figure 36. Final configurations of NS Teams in Task 1, 2, 3a, and 3b. ....................................................................... 109 Figure 37. Final configurations of teams with different social abilities in Task 1...................................................... 110 Figure 38. Final configurations of teams with different social abilities in Task 2...................................................... 111 Figure 39. Final configurations of teams with different social abilities in Task 3a. ................................................... 112 Figure 40. Final configurations of teams with different social abilities in Task 3b ................................................... 112 Figure 41. Euclidean distance boxplot of training results. ........................................................................................ 113 Figure 42. Learning curves of NS, ON2, ON4, and ON5 Teams in Task 3b. ............................................................... 114 Figure 43. Learning curves in Task 1, 2, 3a and 3b. ................................................................................................... 116 Figure 44. Reward difference to baseline (NO SOCIAL). ............................................................................................ 117 Figure 45. Team energy efficiency evaluations for Task 1. ........................................................................................ 120 Figure 46. Team intelligence evaluation fields for Task 3b. ...................................................................................... 121 Figure 47. Team performance records in Task 3b. .................................................................................................... 123 Figure 48. Teamwork division of different teams in Task 3b..................................................................................... 124 Figure 49. Experiment design for the knowledge transfer study. ............................................................................. 131 Figure 50. Final configurations of 10-agent teams (#Tr = #Ts = 10) in Task 1. .......................................................... 132 Figure 51. Final configurations of teams with size of 2, 4, 6, 8, 12, 14, 16, 18 in Task 1........................................... 133 Figure 52. Boxplot of Euclidean Distance of transferring to teams in Task 1. ........................................................... 135 Figure 53. Boxplot of Euclidean Distance of teams in Task 3b. ................................................................................. 137 Figure 54. Final configurations of 10-Aaent teams with social abilities in Task 3b. .................................................. 138 Figure 55. Final configurations of transferring social abilities in Task 3b.................................................................. 139 Figure 56. Boxplot of Euclidean Distance of teams with social abilities in Task 3b. ................................................. 140 ix Abstract Self-organizing systems (SOSs) feature flexibility and robustness for tasks that may endure changes over time. Various methods, e.g., applying task-fields and social-fields, have been proposed to capture the complexity of task environments so that agents can remain simple. To expand to complex task domains, the multiagent reinforcement learning (MARL) approach has been taken to train agent teams to be more capable and intelligent, permitting reduced load and complexity in characterizing tasks and devising agent knowledge. The design of reward functions in MARL is key, and it is heavily dependent on the experiences and knowledge of the designers. The challenge is how one can provide better understanding and guidelines for SOS designers, particularly in complex task domains. To deal with this challenge, we have introduced a general form of reward shaping fields and investigated the impact of various types of fields on the performance of team task executions. The experiment results demonstrate the effectiveness of reward shaping and reveal that singularities, proper forms and shape of the fields, and suitable settings for a specific team are essential for successful training. In addition, we investigate the potential of social learning as a means to enhance AI team performance through internal interactions and develop a novel framework for integrating social learning into MARL. The experiments exhibit the significant benefits of social learning, including improved team capabilities and better energy efficiency for complex tasks. Furthermore, we examine the process of transferring the learned knowledge between teams of different sizes to address the challenges posed by the high training costs in MARL by considering influential factors such as knowledge quality and team social ability in the transfer process. The experiment results indicate the distinct effects of transferring knowledge learned by a smaller (or larger) team to a larger (or smaller) team. The conceptual framework and the experimental insights obtained from this research contribute to x the intelligent and autonomous system design by deepening our understanding of reinforcement learning based self-organizing systems and providing methods and guidelines for designing more capable agent teams that are more knowledgeable, through effective reinforcement learning, for performing complex tasks and ensuring high-quality system performance. 1 Chapter 1. Introduction 1.1. Background and Motivations Complex systems can be applied to wide engineering fields, such as production and assembly lines, industrial warehouse robotics and air force and navy systems, etc. As the task requirements get higher and the task environments become more unpredictable and complicated, the demands of a high level of automation and intelligence arise, leading to higher requirements on the complex system design. Therefore, constructing a complex system with high robustness, adaptability, resilience, flexibility, and stability has drawn great attention from designers, researchers, and engineers. Self-organization is a concept that origins from biological sciences and gradually developed as a promising approach [1]. It refers to the process of organization to an arising overall order from members' local interactions [2] and an ability of a class of self-organizing systems (SOS) to change their internal structure and functions in response to changing external circumstances [3]. Due to its advanced features, such as high-level adaptability and decentralized control [4], it has become one of the promising strategies for solving the problem of complex system design. One major advantage of the self-organizing systems approach is that each agent in a self-organizing system can be kept relatively simple, i.e., possessing very limited knowledge, and the emergent behavior of the overall system can be expected to be sophisticated enough to deal with various demanding tasks. Those advantages make self-organization excellent in wide engineering applications. For example, swarm robots [5] can replace humans with repetitive or dangerous tasks, like package delivery and 2 construction. Through the self-organizing method, teams can behave cooperatively and spontaneously without global control or detailed decentralized design. Even though an individual may have limited sensing and movement abilities, teams have the potential to achieve incredible team accomplishments. Self-organizing systems (SOS) design can be achieved by several approaches. Divide and Conquer (D&C) is an intuitive approach [6] that follows human logic to decompose a big design problem into several sub-issues. However, that process requires a relatively deep understanding of the system and the task. As more objects are involved in a system, the amount of information for processing grows greatly, and the complexity of interactional relationships among elements increases exponentially. The designer-centered approaches show limitations on pre-defining solutions and practicing a top-down fashion. When designers are looking for ways out of the system complexity issue, we can borrow inspiration from nature. For example, fish schooling is a well-studied example of self-organization. The fish cluster automatically follows others for food without any leaders. Bird flocking is another example. They demonstrate complex patterns of collective flying relying on local interactions among group-members [7]. Inspired by the biological phenomena, Zouein [8] has developed a new design framework, the DNA-based Cellular Formation Representation framework (cFORE) [8], as shown in Fig. 1, addressing issues of adaptive systems design. The framework represents an artificial system in a manner of biological systems to achieve high adaptability. In nature, DNA (deoxyribonucleic acid) stands for “an organic chemical that contains genetic information and instructions for protein synthesis” [9]. It holds “design information” by generating proteins to perform every function. 3 Borrowing from that idea, an “artificial DNA,” Design DNA (dDNA), is designed as a “design information” carrier to assist engineering design. A formation task in which a group of cells regulates themselves from a “spider” shape to a “snake” shape has verified that utilizing dDNA- guided cellular self-organizing can achieve functional and adaptive multiagent systems and show the method’s potential. It contributes to deepening the understanding of the self-organization design with respect to biological system design and development, system composition and representation, design information representation and retention, system-to-environment interaction, and system-to-system interactions [8]. What’s more, the use of biological principles in a design framework pushes forward the development of natural, life-like adaptive systems. Figure 1. Structure of cFORE. However, the adaptability of the framework is limited by the pre-defined design information stored in the dDNA. On the one hand, designers need to code the global goal and interactive principles, which faces challenges of varied representation. Moreover, the ability of the agent teams is restricted by the quantity and quality of the stored data. Once the data is inserted, the system is hard to evolve or scale up. To overcome the issue, Chang Chen has proposed a novel biology-inspired system representation called Behavior-based design DNA (B-dDNA) [10] based on the previous work of Cellular Self- 4 organizing Systems (CSO), shown in Fig. 2. Besides, a mechanism called Field-driven Behavior Regulation (FBR) is proposed to implement and synthesize system Designing, Formation, Operation, and Adaptation processes [10]. The field-based behavior regulation (FBR) approach [11] utilizes artificial fields to guide agents' self-organization. The research contributes a novel idea of regulation based on behaviors. The method excels at regulating group behaviors mathematically by connecting specific behaviors with functional, system-level, operational, and adaptation requirements. However, field concept building demands a lot of heuristic knowledge from designers. Also, no interaction among agents in the model is not considered, which is not a common case for practical engineering. Figure 2. Chang Chen’s approach to designing a CSO system. Concentrating on the interaction problems, Chiang has detailed a new CSO approach considering the interaction among agents inspired by natural phenomena [12]. A Meta-Interaction Model (MIM) [12] of the behavioral model of agent interactions is developed through a parametric approach centered upon interactive behaviors, as shown in Fig. 3. The model focuses on the relationship between local agent interactions and emergent collective system behaviors. By considering the “social” attribute, the group behavior regulation in an SOS becomes more reliable. From the design perspective, Cohesion-Avoidance-Alignment-Random-Momentum (COARM) Behavioral Model is established for the Meta-model implementation, which applies 5 parameterization to manipulate the multi-functionality of cellular. However, this research has a simple assumption and inflexible interaction mechanism that each agent is only aware of its neighbors’ information. Another issue is the difficulties of defining hyperparameters for the meta behavior matrix for designers. Figure 3. Three layers of abstraction to the Meta-Interaction Model. Khani has considered both “task” and “social” aspects in one self-organizing system based on the previous work [13]. She proposes a two-field mechanism to enable CSO systems and to investigate the way that social rules among agents influence CSO system performance. Through adding a “social” field in the system, the information received by agents is enriched, which benefits the task performance with respect to more complex environments. The task field practices a high-level control that ensures each mCell executes a useful movement, while the social field conducts a low- level control on each cell to perform locally correctly to accomplish its task. However, the work lacks task variety. Describing a task in proper mathematical formats is still challenging as the task complexity increases. More intelligent approaches are desired. Humann [14] continued to work on the complex systems design issue through self-organization. Based on previous approaches (field-based behavior regulation and behavior parameterization), Humann targets to solve the issue of lack of emphasis on system-level understanding [14] in 6 response to the call for a unified CSO framework for modeling, design, analysis, and optimization. To solve the problem, Humann has applied a genetic algorithm (GA) for selecting hyperparameters [14]. The ontology for self-organizing systems includes behavior modeling and behavior regulation. Several case studies have proved that GA-based computational optimization is well- grounded and beneficial for adaptable systems design. However, the work is narrowed by the randomness of the GA. It still requires designers holding a large amount of knowledge to search for a proper way of presenting the task and corresponding design requirements. Although the field-based behavior regulation approach is effective, developing definitions of both task and social fields for complex task domains can be challenging. As the artificial intelligence techniques bloom rapidly and turn out to be mature, the learning-based approaches are taken advantage of in the SOS design to let agents acquire their behavior knowledge in a trail-and-error process. Considering that the bottom-up design framework needs a small extent of top-down guidance, researchers are making efforts to release the human’s deep participation and make the design processes more freely and intelligently accompanying a high design quality. Hao Ji has utilized deep multiagent reinforcement learning in the design of the self-organizing system and carried out physical structure assembly simulations to explore its practicality and inside mechanism [3]. The multiagent reinforcement learning (MARL) method is employed where multiple agents learn how to accomplish the shared task collaboratively by maximizing a shared reward function. Each agent is trained with its own independent neural network, shown in Fig. 4- (a). In that approach, all agents can gain their own optimal policies of how to cooperatively perform stored in neural networks through training. It can be considered as a bottom-up design process with slight requirements on top guidance since designers only need to set state, action, and reward functions before the training. 7 In multiagent reinforcement learning (MARL), reward functions play a crucial role in encouraging exploration and provide a gradient in the learning process, which greatly influences training efficiency, learning speed, and task performance of the trained systems. Although Ji has identified the importance of rotation rewards in regulating agent behaviors [15], it is still in big shortage of in-depth knowledge on the effect of reward functions and their systematical views. While system design highly relies on designers' instincts, experiences, observations, and prior knowledge [3], the lack of systematic understanding may lead to risks of systems inefficiency and failure. As the task complexity increases, several issues arise. Firstly, designers face challenges in how to organize and design effective reward functions to accomplish the task requirements. Especially when neural networks are involved in MARL algorithms like deep Q-learning [16], the abstraction of training processes takes place in a black box, which makes it challenging to predict the outputs. That uncertainty takes a lot of costs. A controllable way of efficiently utilizing designers’ knowledge is the key to a successful system design. To solve that issue, my research will investigate the reward shaping (RS) design and its impact on learning and task performance. The objective of this study is to gain valuable insights into the underlying mechanism of self-organizing teams from the reward function design perspective. Through uncovering those mechanisms, the research is expected to expand our knowledge of how to effectively design and optimize self- organizing teams. Furthermore, prior research has overlooked the crucial social aspect within groups, which limits the potential of a team. Without adequate consideration of the social structure and its influences, individual agent relies solely on environmental information to learn and iterate their policy. This 8 neglect of a team’s social structure makes agents blind to others’ statues, plans, or influences on the task. What’s more, the absence of a social structure in a group narrows the views and diminishes the information amount that individuals could have learned. Thus, it is worth investigating whether enhancing agents' abilities to observe others' choices of actions could gain benefits for teamwork, and if so, in what specific ways. Therefore, standing on that point, my research aims to address this gap by emphasizing the importance of social structures and their impact on team behaviors. By incorporating the social ability shown in Fig. 4-(b), we seek to enhance the learning teams’ social abilities, like observing the behaviors of one’s team members, thereby deepening the understanding of AI team dynamics and effectiveness. As stated by Ashby [17], “To deal with the tasks that are more complex, the system must possess a sufficient level of complexity.” We believe that social learning research is a necessary step toward solving more complex engineering tasks and enabling system design. (a) individual learning (b) social learning Figure 4. MARL structures. Finally, we prioritize addressing the challenge of high training costs within the context of multi- agent reinforcement learning. As tasks are more complex, the problems such as long training time and a vast amount of computational resources cases by huge sample complexity become serious. Thus, the question arises: can the previously trained team knowledge embedded in neural networks 9 be reused so that the new training time can be significantly reduced? To overcome this limitation, we will focus on addressing this overall question by investigating the process of transferring knowledge learned by a certain agent team to another in the same task domain but with different team sizes. Furthermore, the effects of several influential factors covering knowledge quality and the team’s social ability are delved into as well. In total, the aim of our work is to contribute towards the design of robot teams with high-level intelligence and autonomy while minimizing the design cost and mitigating the limitations of the designer’s prior knowledge. By bridging the gaps between humans and robots and fostering deeper bidirectional recognitions, this research seeks to advance our understanding of how the learning- based SOSs operate and our capabilities in designing effective collaborative systems. 1.2. Mechanical Engineering Design Engineering design is different from other more artistic forms of designing that concentrate more on engineering fields and is constrained by engineering-scientific factors [18], ranging from mechanical engineering to economic fields. Design approaches applied in engineering design are different from other designs as well since engineering design emphasizes logistical and systematical practice to enhance understanding, good record-keeping, and traceability for the design process [18]. Mechanical engineering design has its own principles and uses mechanical design languages. That requires that mechanical designers have knowledge of mechanical design disciplines about mechanical parts and assemblies etc. [19]. As automation develops, mechanical systems involve more components like electrical components to fulfilling higher functional requirements [19]. At 10 the same time, various design methodologies and advanced tools are developed and applied to assist mechanical design in creation, model analysis, and optimization [20]. Some system design methodologies are General design, Axiomatic design, Agile design, etc., which all play important roles in the mechanical design area. General design theory (GDT) is proposed by Yoshikawa [21] to assist computer-aided design (CAD) for growing innovative design requirements. General design operates and defines design knowledge as mathematical sets [21] to tackle the phenomena that the knowledge becomes more intensive. It discusses the design process and solutions under two conditions, design in ideal knowledge and design in real knowledge [21]. Designers can obtain design solutions by updating and mappings among Function space (in terms of physical phenomena), Attribute space (measurement by physical rules), and Metamodel space (concludes physical features) according to their own situations [21] [22]. It assists and optimizes the design process greatly, especially in CAD domains. However, designers are still facing the challenges of transferring various design requirements into logical symbols and expanding their applications in other domains. Axiomatic design is proposed in [23] as a theoretical basis for system design in order to overcome the difficulties of designing a complex system. Through defining a transformation matrix, a flow of mapping can be constructed from customer needs (CNs) to functional requirements (FRs) to design parameters (DPs) to process variables (PVs) [23]. That approach excels at centering the customers, making the design solutions more practical, and lowering failure risks. It guides designers in organizing a knowledge database and a couple of different terms involved. However, potential interferon among different design variables brings challenges in designing a complex product with a large scale of factors. Also, this approach still highly relies on the designers’ capabilities and knowledge. 11 Agile design is a friendly approach for implementation and cooperation in a real-business environment for designers. It excels at stringing small projects into bigger and ongoing projects with fulfilling separate requirements [24], which emphasizes rapid, iterative releases of working products at the expense of complicated planning and design [24]. Though it is widely used by software developers, it is hard to help an individual designer to be easier to generate good solutions. Besides, this methodology focuses more on the design process than looking for a good solution to a design problem. As technology develops, mechanical engineering design gradually involves more complicated components and factors, not only limited to mechanical parts. That enlarges the problem domains, and the systems get more complex that needs to cooperate with other aspects. For example, Amazon is working on developing robot teams to delivery and transfer packages in warehouses. The need for autonomous, stable, and intelligent systems is calling for advanced methodologies and a deeper understanding of mechanical design. To stabilize the systematic outputs, alleviate design difficulties, and stimuli more creativities from designers, artificial intelligence (AI) technologies are applied to assist engineering design. Designers take advantage of powerful computational abilities to build large knowledge representation systems, train virtual design ‘experts’ and conduct innovative designing. Even though designing relies highly on designers’ experiences and expertise, researchers are working on the exploration of the designing mechanism to deal with more complex dynamical systems and higher design requirements. 12 1.3. Multiagent System and Self-organization As technology develops, more complex applications in engineering fields have progressively shown up. That involves the need for applying systems that compose of multiple agents in one environment [25] to implement complex tasks thanks to their higher capabilities. For example, robot teams can be regulated to substitute human works to transport packages in a warehouse. The teams are expected to work cooperatively and intelligently to implement multiple functions, including avoiding obstacles. A traffic network is another application example that is used to coordinate vehicles in a busy intersection [26]. A system that contains more than one entity with the capability of independent actions is called Multiagent System (MAS). Several major characteristics of MAS have been summarized in [25]. They are (1) each agent has a local viewpoint rather than a global view; (2) all agents are in decentralized control; (3) the data is decentralized, and the computation is asynchronous [25]. Based on those features, MAS can apply distributed problem solvers (DPS) [26] [27] to construct solutions. To organize the complex MAS, there are two big categories of approaches [28]. They are agent-based approaches which embed organizational information at the agent level [28], and organization-centric approaches, which impart system information representation to a global controller [28]. They define different logistics in multiagent organizations, which are applied in different situations. While agent-based approaches emphasize that the organizational structures exist within the individual agent’s state [26], organization-centric approaches address high-level issues related to how the group responds to problems. However, as a larger number of elements get involved in environments and higher requirements are put forward to tasks, the complexity of the system’s organization is rising greatly. Therefore, 13 how to organize, regulate and optimize complex systems and how to make the systems effective, robust, and high-capability have attracted great attention from researchers and engineers. In a complex MAS, it becomes very challenging to define global organizational principles. Therefore, our research mainly focuses on applying the agent-based approach to explore the design mechanism in MAS. There are several advantages of agent-based approaches. Firstly, it does not require a fixed structure of an organization, which shows higher flexibility not only in designing solutions but also in wider engineering applications. Besides, since the organizational structure is captured in the individual agent’s state, the system becomes more robust. The system organization is released from relying on every member’s fulfillment. What’s more, the size of the agent team is easier to expand when the scale of the task enlarges. The self-organizing approach is applied as an organization approach to tackle complex problems thanks to its own advantages. The concept of “self-organization” is defined as “a mechanism or process which enables a system to change its organization without explicit command during its execution time” [29]. It works widely in many fields, such as multiagent systems [29], grid computing [30], sensor networks [31], aerospace science to allocate satellites [32], and robotics to organize robot teams [33]. It has been proved a useful discipline and practical philosophy to guide a complex system’s organization. “Self-organization” emphasizes several outstanding characteristics of a self-organizing system (SOS). Firstly, a self-organizing system is fully autonomous since there is no explicit external control [26], leading to its high adaptation of the system. Meanwhile, an individual agent only has local interactions with surroundings and no need to access global information. That makes each individual keep a relevant, simple form but still hold the capabilities to implement complex tasks. Furthermore, a self-organizing system is easy to scale up and evolve as the environment changes 14 [29] since each agent works in a relevant independent way. That benefits the higher adaptability of the system. Therefore, it has drawn the great attention of researchers to solve multiagent system problems, and a rising of many methods follows. The cellular self-organizing approach was designed by Chiang and Jin [34] that is inspired by natural phenomena to extend the design envelope facing more uncertain environments. Besides, a field-based behavior regulation (FBR) mechanism is proposed in [35] that designs a system by following the guidance of artificial fields. However, those methods still face the challenges of designing a proper field to guide effective learning. As artificial intelligence booms, the reinforcement learning approach has been applied to solve the problems in multiagent systems, which makes the learning process more autonomous and intelligent rather than designing the system’s rules manually by designers [36]. However, there is still lacking a deep understanding of how artificial intelligence guides effective agent learning and performing and following predictable output. Therefore, our research will focus on exploring the design mechanism in a MAS through a reinforcement learning approach targeting to solve more complex problems and extend the understanding of learning. 1.4. Deep Reinforcement Learning Approach Inspired by biological networks and human nervous systems, artificial neural networks (NN) have been brought into the artificial intelligence (AI) field as an important milestone. Artificial neural networks simulate computational units in a learning model treated as similar to human neurons [37]. The technology has empowered solutions in tasks like speech recognition, visual object recognition, object detection, and many other domains, such as drug discovery and genomics [38]. As the century of data arrives, artificial neural networks have become the 15 foundation of deep learning and achieved a lot of success in deep learning due to different specialized architectures for various domains, like recurrent neural networks and convolutional neural networks [38]. Deep learning composes of multiple processing layers of neural networks to learn representations of data [38]. It captures multiple levels of abstraction and enables computers to build complex concepts out of simpler concepts [39]. Through setting several visible layers and hidden layers, representation information is extracted and presented in a mathematical format, even though not all can be encoded as understandable information for humans. By updating layers’ parameters, the loss functions are iterated to shrink, and the patterns hidden in the large dataset are captured and learned gradually. What’s more, its abilities can be improved as incremental experience and data. Those features make deep learning a significant approach to solving complicated problems with increasing accuracy over time [39]. According to the different characteristics of input data and learning mechanisms, learning can be classified as supervised learning, unsupervised learning, and reinforcement learning (RL). Unlike supervised and unsupervised learning, which is trained by knowledgeable data provided by an outsider, reinforcement learning is an important technique for learning from experiences and trials. RL takes inspiration from research into animal learning [40] combined with theoretical concepts of optimal control theory. RL learns what to do [41] based on the Markov decision process [42]. The learners must discover which actions yield the most rewarding accumulated rewards by trying them [41] and maximize long-term profit in their interactions with the environment [43]. The trail- and-error and delayed rewards are RL's two most distinguished features [41] that make it a very useful solution to self-organizing systems (SOS). What is needed from designers in a reinforcement learning algorithm is to define the task problem. Agents can master how to behave 16 through continuous trials and improve their behaviors by receiving rewards or punishments towards the goal predefined by designers. By saving and updating information on a number of the latest trails, agents continuously grow to be more intelligent. Those make RL has its own applied domains, like autonomous vehicles, robotics, manufacturing, gaming, recommendation systems, real-time financial strategies, etc. Some reinforcement learning algorithms apply tabular savers to storage the buffer data, like Dynamic Programming (DP) [44], Monte Carlo (MC) [45], and temporal-difference learning [46]. They have advantages like good efficiency and ease of building the models. However, their capabilities are limited by the storage space of computers. By combining deep learning into reinforcement learning algorithms, like deep Q-learning [16], and replacing tabular savers with neural networks, the learning abilities and potentials have been released greatly since only parameters of NN layers need to be saved and kept updated as the training goes. Deep reinforcement learning (DRL) has expanded applications to high-dimension problems and more complicated tasks. Depending on the number of agents participating in one task environment, RL can be divided into two categories: single-agent reinforcement learning, which concerns one agent, and multiagent reinforcement learning (MARL), which tracks more than one agent. In single-agent reinforcement learning, the agent searches for an optimal policy by interacting with the environment. The agent’s action choice influences the environment, and the state set of the agent and the environment are updated as follows. A policy defines the learning agent’s way of behaving at a given time [41] and the way that environment updates is defined by a model. For some simple problems, the model is known for agents and used for planning. Such methods, called 17 model-based methods, can predict the resultant next state and reward given the current ones. On the other hand, the problems that do not have any prior knowledge of the model can be solved by model-free methods that are more practical and widely used. Among model-free methods, there are two kinds of algorithms: value-based algorithms that learn the values of actions and selected actions based on their estimated action values, and policy-based algorithms that learn a parameterized policy that can select actions without consulting a value function. MARL focuses on studying the behavior of a group of agents in a shared environment. Each agent is motivated by its own credits and targets to earn a maximum long-run reward. After adding more agents in one environment, the action and state space increase exponentially. Thus, inserting neural networks in MARL algorithms has been developed to break the limitation of storage and computational power, which is called multiagent deep reinforcement learning (MADRL) [47]. It has shown promising progress in terms of its applications and capabilities. 1.5. Research Questions Facing the challenges of lack of in-depth understanding of designing self-organizing systems in complex task domains, this research focuses on three critical issues in MARL-based system design: shaping reward design, social learning, and its impact, and knowledge transfer among different teams. Therefore, we address three overarching research questions: • Research question 1: What are effective shaping reward functions, and how do they impact the MARL learning process and the SOS team performance? • Research question 2: What social learning can be devised into SOSs, and what are the cost and benefits of social learning for the agent team? 18 • Research question 3: Can the agent knowledge, learned based on reward shaping and social learning, be effectively transferred to a novice agent team of a different team size? The forthcoming contents will introduce further questions and inquiries to delve deeper into each of them. 1.6. Overview The subsequent sections of this thesis are organized as follows: Chapter 2 provides a comprehensive review of the related work. In Chapter 3, the methodology applied in this thesis is outlined and explained in detail. Chapter 4 is dedicated to the reward shaping field design and its impact on the team behaviors. Afterward, Chapter 5 presents the modeling, conceptualization, problem formation, and results of social learning in MARL. Chapter 6 focuses on the exploration of knowledge transfer studies among different teams. Finally, the conclusion is drawn in Chapter 7 by summarizing the findings, highlighting the contributions made by this research, and discussing future research directions. 19 Chapter 2. Related Work 2.1. Multiagent Reinforcement Learning A multiagent system (MAS) [26] represents a group of interacting entities that act in a common environment [48]. They possess greater capabilities than single-agent systems in implementing different tasks. Therefore, multiagent systems have been rapidly developed and applied in a variety of fields, such as aeronautic, traffic control, robotics, and game theory. Nevertheless, as more agents participate in one task, the system complexity increases significantly, which leads to a high computational cost. Besides, system designers face challenges when it comes to pre-planning and programing agent behaviors in advance. In order to tackle the above issues, MARL has been proposed and applied in MAS to train agents to obtain optimal policies. MARL is developed from single-agent RL algorithms to seek solutions in a MAS. In single-agent problem domains, Q-learning [49], as a foundational model-free algorithm, enables an individual agent to learn its own optimal policy from the tabular state-action values. However, this method is constrained by the saving space limitations. To overcome the challenge, a combination of neural network techniques and Q-learning has been introduced, known as Deep Q-Network (DQN) [16]. It boosts the capabilities of the multiagent algorithms greatly. Moreover, the Deep deterministic policy gradient (DDPG) [50] combines neural network techniques with an actor-critic algorithm to approximate the deterministic and action-value function Q, respectively. The policy network is responsible for inferring state-action pairs and providing the corresponding gradient, while the critic network assesses the quality of action-value functions. 20 As the systems evolve, the complexity of MAS grows increasingly compared to that of single- agent scenarios. To address it, MARL extends the principles of the Markov decision process (MDP) and has been developed for applications in various scenarios, ranging from fully cooperative to fully competitive and even combinations of both [51]. MARL’s learning process closely resembles human learning, as agents gain experience and seek optimal policies from trial-and-error interactions [52]. When the environments are partially observable for agents, Lowe et al. developed a Multi-agent Deep Deterministic Policy Gradient (MADDPG) [53] approach that uses centralized training with decentralized execution. In [54], Decomposed multi-agent deep deterministic policy gradient (DE-MADDPG) has been proposed to tackle a cooperative situation by maximizing the global reward of the team or the individual local rewards. It is claimed that the novel method shows better performance in stabilities and adaptability. Cai et al. [55] target solving the safety problems in the real world, like collision avoidance. The authors propose to combine MARL with decentralized Control Barrier Function (CBF) shields based on available local information, named MADDPG-CBF. The Team Q-learning algorithm [56] is a method for purely cooperation tasks in dynamic environments. It assumes that all agents are capable of knowing other actions before taking their own, and the optimal joint action is unique [56]. This method avoids the coordination problem by letting all the agents learn the common Q-function in parallel. However, the extensive learning space makes the algorithm burdened or even worthless in computing cost. The Distributed Q-learning algorithm [57] aims at solving cooperative tasks, which assumes that each agent has no information about others. It proves that the optimal policies of deterministic environments can be found. However, the algorithm cannot be applied in stochastic environments. 21 In fully competitive tasks, the minimax-Q algorithm [58] explores the Markov game formalism as a mathematical framework and employs the minimax principle for reasoning about multi-agent environments. A temporal-difference rule like Q-learning is applied for value calculating across state-action pairs. It has been tested in several competitive games such as checkers, tic-tac- toe, backgammon, and Go. However, it is restricted to two-player Markov games. In order to enrich the useable information during the training process, many studies have explored incorporating communication among agent teams. For instance, Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL) [59] are proposed as learning-based communication. CommNet [60] improves the communication protocol between agents from previously unaltered to adaptive. A simple neural network model is applied to map the inputs of agents to their actions. Each agent shares its own information by occupying a subset of channels. However, the approach has simplified the communication model and may suffer a sparsity problem. While the above approaches make valuable contributions, it is evident that they are primarily effective for solving simpler tasks in MAS, primarily due to a lack of comprehensive understanding of multiagent organizations. Therefore, this thesis will specifically target cooperative tasks with high levels of complexity. By conducting experimental studies, we aim to attain a profound understanding of the underlying mechanism for self-organizing systems. 2.2. Heuristic Knowledge in MARL While the MARL approaches have demonstrated effectiveness in dealing with self- organizing problems, it still suffers a certain level of uncertainty and randomness, leading to challenges in achieving high-quality training outcomes. This is primarily due to the reliance on gradients with the reward functions. However, the lack of prior knowledge about the environment 22 makes reward functions difficult to be designed, especially in sparse reward environments. In order to mitigate the issue, this section will present applicable knowledge that can assist MARL training and aid agent teams. Figure 5. Heuristic knowledge in MARL. As presented in Fig. 5, various sources of knowledge in reinforcement learning can be classified into two main sources, which are heuristic knowledge and online learning [61]. Heuristic knowledge means prior information obtained ahead of training, which can be from designers or offline learning, while online knowledge refers to information obtained during the learning process, such as strategies, policies, or decision-making based on real-time data [61]. When utilizing the knowledge, the challenge lies in effectively organizing and incorporating knowledge into the system. Various techniques aim to address this issue by integrating external knowledge into RL algorithms. For instance, some knowledge can be gained from offline learning and saved in sample batch [62], learned black-box policy [63] or demonstrations [64]. Those can be applied to accelerate the learning speed or mitigate the learning difficulty through transfer learning etc. Policy transfer [65] is an important technique to “borrow” previous experiences to current tasks, which can be implemented as transfer learning [65] and is especially effective 23 applied to similar tasks. However, when tasks suffer big differences, the capabilities of transfer learning are limited. Besides, the application of transfer learning in MARL is difficult since the mechanism of knowledge allocation turns complicated as the team size increases. Another way of utilizing external information is to store them in batch samples [66] for providing off-line advice. The knowledge or experiences can be saved as tuples like < 𝑠 ,𝑎 ,𝑎 ′ ,𝑟 > [66], which can be freely saved and queried by the team when they conduct tasks. Moreover, demonstration [67] is also a valid way to save knowledge in terms of policy sequences. By querying the experiences tuples, agents can earn guidance from other experts. However, it requires a very high compatibility of two cases, which limits the knowledge implementation. It also suffers from the problem of its dependence on task similarity. But it is effective to be used for initial passive learning [68] [69]. It is reasonable to assume that system designers hold an amount of expert knowledge of tasks since they are required to design state, action, and reward functions in reinforcement learning algorithms [70]. The knowledge from designers carries the big potential to be delved into and applied. Therefore, how to effectively employ that knowledge has attracted the great attention of researchers. The external knowledge from designers can be categorized in several aspects shown in Fig. 5, which make it easier to organize, including 1) state-based knowledge, 2) action-based knowledge [71], 3) task-based knowledge, and 4) goal-based knowledge. Instances for each category are given below. • State-based knowledge: what the state space is; what physical meaning each state represents; what states are dominant among all; … 24 • Action-based knowledge: what the action space is; what the relationship between agents’ execution ability and actions is; … • Task-based knowledge: what signals/variables are detectable and measurable; which signals/variables are crucial to the task or relevant to the task’s success; … • Goal-based knowledge: what the task’s goal is; how to describe the goal; what signals/variables can fully define the goal; if those signals/variables can be detected by the system; how to represent the system preference by using detectable signals/variables; … If designers could find a proper way to encode heuristic knowledge in the training process, the learning ability can be boosted greatly, and the learning performance can be improved highly. Research has proved that heuristic knowledge can be easily identified by system designers using reasoning in many domains [71]. Therefore, in this thesis, we focus on taking advantage of heuristic knowledge from designers to facilitate reinforcement learning through the reward function design. 2.3. Reward Function and Shaping Reward Design When applying MARL in self-organizing systems to solve practical engineering tasks, the goal is to find an optimal policy for each agent mapping state and action. The individual agent follows the policy to cooperate and behave in a team to achieve the team’s goal. The set of action policies can collectively maximize the return defined by the reward function, which provides a numerical score to offer a gradient for the algorithm’s iteration, guide agents’ learning, and encourage exploration. 25 Reward functions usually include local rewards and global rewards [72]. While local rewards regulate a partial system that an agent can directly observe, global rewards give hints on if the outcome of agents’ performances is consistent with what a task overall requires. One major challenge for both RL and MARL is how to design a proper reward function. To solve complex tasks, there are many techniques to optimize reward structure and enrich the information in the system. Wu et al. attempt to solve the agents’ “lazy problems” that occurred in a cooperative task [73] by introducing an individual intrinsic reward algorithm. An intrinsic reward encoder is introduced to generate a different individual intrinsic reward for each agent. Then a decoder utilizing the hyper-networks is designed to help estimate the individual action values of the decomposition methods. However, it still requires great heuristic knowledge from outsiders to design Q-value based on the current task. That may limit the applications of this method. Mao et al. study the reward design problem in cooperative MARL based on packet routing environments [74], focusing on local reward and global reward. The two reward signals are proved to produce suboptimal policies. Getting inspiration from observation, the mixed reward signals are designed for better policy learning. What’s more, the authors turn the mixed reward signals into adaptive ones. However, weighting different reward terms that influence learning performance has not been discussed in the paper. As the task complexity arises, the requirements for information transformation become higher. The local rewards and global rewards may be insufficient to provide enough guidance for agents in some cases, such as high-precision assembly tasks involving complicated geometrical and dimensional relationships [75]. Thus, reward shaping (RS) is deemed an important method to augment reward functions by incorporating additional information into the system [76]. What’s 26 more, agent teams also gain benefits from detailed guidance, leading to a higher level of intelligence, better performance output, and fast learning speed. There are two main methods of RS, i.e., potential-based reward shaping (PBRS) [76] and difference reward shaping (DiRS) [77]. PBRS is an effective way to apply the designer’s heuristics to inform agents of the system’s preference. It helps express a preference for agents to reach a particular state by constructing a potential field Φ. The field function Φ can be defined over different terms, including state 𝑠 , action 𝑎 , or state-action pair ( 𝑠 ,𝑎 ) based on different algorithms. Generally, it can be presented as: 𝑃𝐵𝑅𝑆 = 𝛾𝛷 ( 𝑠 ′ ,𝑎 ′ )− 𝛷 ( 𝑠 ,𝑎 ) ( 1 ) where 𝛾 is a discount factor, and 𝛷 is a field function. Wiewiora et al. propose two forms of PBRS, look-ahead advice and look-back advice [78]. The methods shape rewards targeting different state-action pairs and recommend using them under different conditions [78]. Plan-based reward shaping has been proposed and applied in both single- agent [79] and multiagent reinforcement learning [80]. This method applies a reasoning technique to search for a path from the initial state to the goal state. And the trajectory of states can be used to define a potential field. Badnava et al. present a novel potential-based reward shaping method to accelerate the learning speed by making a reward function changing each step that can push agents to make progress frequently in the task [81]. Brys et al. proposed a method that uses expert demonstrations to speed up learning by biasing exploration through shaping reward by calculating the similarity between state-action pairs [82]. Mannion et al. [70] discussed the theoretical 27 implications of applying these shaping approaches to cooperative multi-objective MARL problems and evaluated their efficacy using two benchmark domains. DiRS refers to a shaped reward signal that helps an agent learn the consequences of its own actions on the system after removing the other agents’ influences or noises [83], as shown below. 𝐷 𝑖 ( 𝑧 )= 𝐺 ( 𝑧 )− 𝐺 ( 𝑧 −𝑖 ) ( 2 ) where z represents a general term of states or state-action pairs. 𝐺 ( 𝑧 ) is a return of all agents and 𝐺 ( 𝑧 −𝑖 ) is a return without the work of agent i. DiRS focuses on quantifying each agent’s contribution to the system, but its applications are restricted to certain problem domains and demonstrated to work only in cooperative tasks [84]. For a team with homogeneous agents, DiRS has limited power to stimulate individual agents’ learning and is ineffective in learning speed improvement since each agent’s outcome may be minor compared to the team, making 𝐺 ( 𝑧 ) very close to 𝐺 ( 𝑧 −𝑖 ) . The above two approaches are capable of boosting the performance of the MARL. However, there still exist many challenges for designers to apply them to practical problems. One issue is that RS often requires domain-specific heuristic knowledge [81] [85], which is hard to obtain directly. These requirements can be particularly demanding, especially when dealing with unknown, dynamic, and complex tasks. Additionally, a notable issue is the limited research available on determining the most effective formations for embedding the information. Previous applications of RS are only implemented in relatively simple problem domains. Those limitations hinder the RS’s potential and applications. Hence, this thesis will focus on deepening our understanding of how the reward shaping signals and different forms of reward shaping fields may impact the 28 training process of agent teams, thereby leading to different results in task performance. This research aims to facilitate the transformation of heuristic knowledge possessed by the designer into effective formats that agents can comprehend and utilize. 2.4. Complexity Measurement Systems that consist of interrelated parts are, to some extent, complex. System complexity is “a matter of the quantity and variety of its constituent elements and of the interrogational elaborateness of their organizational and operational make-up” [86], which brings powerful architectures but also difficult challenges. Understanding complexity can assist in first seeing the features of a system [86]. However, it varies from viewpoints in different fields. For example, in physics, physicists emphasize “standards” to define “task complexity” [86]. In computer science fields, code complexity is especially stressed, such as “self-delimiting code” length (Huffman or Shannon-Fano codes). [87] focuses on discussing task characteristics’ effect on human performance in database query tasks. Most of the time, the definition and research on effect need to be very specific to different task types. In MARL research and design, as experimental approaches play more essential roles, a quantitative measurement of the task complexity is demanding to be defined and understood. This can benefit the complex system exploration and the following steps, such as the solver design. Considering different task characteristics, the modes of task complexity and its corresponding salient factors may vary. In [88], metrics are presented for the complexity measurement of MAS. The complexity is divided into “agent-level complexity” and “system-level complexity.” “Agent- level complexity” includes two orthogonal criteria, the agent’s structure complexity and the 29 agent’s behavior complexity [88]. Agent’s structure complexity describes how the agent’s internal structure influences the system’s complexity, including “The size of the agent’s structure (SAS),” “The agent’s structure granularity (ASG)” and “The dynamicity of the agent’s structure (DAS)” [88]. Agent’s behavior complexity describes how different agents’ behaviors that are implemented in the environment influence goal achieving, including “The behavioral size of the agent (BAS),” “The average complexity of behaviors,” and “The average number of scheduled behaviors (ANSB).” [88] On the other hand, the complexity of the global structure of MAS at the system level covers “social structure complexity” and “interactional complexity.” “Social structure complexity” considers all elements in the fields like agents and objects, including “The heterogeneity of agent (HA),” “The Heterogeneity of the Environment’s Objects (HEO),” and “The Size of the Agent’s Population (SAP).” [88] The interactional complexity considers the interactions between agents. It can be measured by “The Rate of Interaction’s Code (RIC),” “The Average of Exchanged Messages per Agent (AEMA)” and “The Rate of the Environment’s Accessibility (REA).” [88] The metrics consider factors at both complexity levels of the agent and system. Therefore, it is not applicable for cases that focus only on task complexity. Also, its application is limited in the software-based MAS [88]. In [89], authors construct a metric to measure collision avoidance task complexity at sea by measuring internal complexity and external complexity. Internal complexity means how difficult the own ship can be maneuvered [89], while external complexity represents how difficult the encounter situation can be to avoid collision [89]. Besides, a complexity hierarchy is built to assist the situation analysis and complexity measurement. Although this method is designed for a single- agent system, it differentiates the task and solution complexities. What’s more, obstacles in the 30 fields are considered for collision avoidance requirements; those are valuable as references to develop complexity measurement in MAS. In [90], the authors introduce a conceptual framework for MAS architecture that can be used for the quality evaluation of a MAS at the design level. A MAS conceptual architecture is proposed from agent-level, collaborative-level, and architecture-level perspectives [90]. The three perspectives land roles in assessing the MAS architecture complexity, providing a comprehensive understanding to assist an effective MAS design. Besides the theoretical framework, the architectural complexity of MAS is comprised and discussed. However, the measurement is highly dependent on the designers’ knowledge of the system, like events, which may not be known in some situations, and hence not applicable to RL-based solutions. In [91], the author proposes a reference frame that assesses agent-based simulations targeting characterizing and differentiating various simulation models. It is claimed that building metrics have the advantage of being objective and exact [91], which is crucial to capture properties and comparing results in a flexible design. The metrics take in system-level, agent-level, and interaction-related level perspectives. However, they are designed for software engineering, which puts more focus on software complexity. In [92], a graph-based model named Task Assignment Graph (TAG) is proposed for the complexity analysis of task assignments from a human perspective in a human/multi-robot team. It takes advantage of a graph to define task maps and captures spatial relationships between tasks and robots. However, the model analyzes the relationship among different sub-tasks, and it requires humans to work as operators to assign tasks in advance. It emphasizes the task assignments in the human-robot interaction domain. 31 In [93], task complexity is viewed from two perspectives, objective and subjective. It states that “the objective task complexity considers task complexity to be related directly to task characteristics and independent of task performers,” while the subjective perspective considers task complexity as a conjunct property of task and task performer characteristics. To facilitate a more accurate estimation of task complexity, numerous task models have been proposed. These models serve as a valuable framework for quantifying and analyzing the intricate nature of tasks. In engineering fields, more work has been done since the pioneering work of Wood [94], who introduced a foundational task that served as a basis for other work. Some of them are summarized in Table 1. Table 1. Task model summary. Task Model Components Wood (1986) [94] Products / Acts / Information cues Farina and Wheaton (1971) [95] Goal / Input stimulus / Procedures / Response / Stimulus- response relationship Li and Belkin (2008) [96] Generic facets / Common attributes Ham et al. (2012) [97] Functional aspect / Behavior aspect / Structural aspect Liu and Li (2012) [98] Goal/Output / Input / Process / Presentation / Time Although the above models are constructed for a broad application and contain general facets, there still lacks a task model that is built for construction tasks solved by reinforcement learning approaches, which may be more concern about effective learning influencers. 2.5. Social Learning in MARL Humans can gain incredible capabilities through social interactions with other people [99], which is a phenomenon that is happened in many intelligent groups like human society and animal 32 groups. It is concluded as social learning theory by Albert Bandura [100], which emphasizes the importance of observing, modeling, and imitating the behaviors, attitudes, and emotional reactions of others. That excellent learning abilities inspire designers to understand the ways of multiagent systems exhibit and to explore if social learning in artificial agent groups can boost intelligence levels. In RL-based agent systems, the design of reward functions is the key to pre-define the group dynamics. Hence, researchers have devoted themselves to developing social reward terms that facilitate team coordination and incentivize cooperative behaviors. In [101], the authors propose a unified mechanism that assumes that actions leading to bigger changes in others’ behaviors are preferable in organizing teamwork. Based on that idea, higher rewards are given to the actions leading to greater influence. The experiments demonstrate that social rewards can enhance team coordination and lead to more meaningful communication protocols. However, the training suffers the risk of failure since the social influences are not evaluated by their effects on the task performance. The team may be misled by a few bad influencers, and the deviations are hardly fixable by the team itself. Sequeira et al. introduce an additional reward signal to guide agents tactically explore the most “enjoyable” states and participate in explorations [102]. It motivates novice agents to perform as team players rather than “selfish” individuals in new tasks. The results show that the method can lead to improved population fitness in a two-agent environment in food-consuming scenarios. However, it suffers difficulties in designing suitable reward functions and parameterizing the team's social mechanism. Since learning success relies highly on a proper design of the reward functions, this method shows applications' limitations, especially when more agents are engaged. 33 What’s more, how to balance the individual’s and team’s interests is a sharp issue when designing additional reward signals. In [103], the authors emphasize behavior improvement by adding an auxiliary loss. The auxiliary loss is defined as encouraging novice agents to learn from “experts.” Through shrinking the loss, novice agents can optimize their optimal policies to be closer to the experts’ state trajectories. Besides, the team is guided by state demonstrations rather than forcing the members to copy the expert's actions, which is more practical than pure imitation. The architecture has been approved to improve the agents’ behaviors in a new environment and can successfully follow the expert’s state trajectory. Borsa et al. show the potential of pure observations and captures the social behavior pattern in an RL-based agent group [104]. The authors demonstrate that RL agents can leverage the observed information and output better teamwork in a team with a “teacher.” Novice agents are capable of observing the state trajectory of the “teacher” and are encouraged to follow. Without explicit modeling, novice agents can learn from mapping the teacher’s state to their own states. The positive effect of pure observation is proved in a series of simple navigation tasks with two-agent teams. The results have shown that observing “teachers” can improve task performance and decrease the training cost compared with other approaches to learning from an expert. However, the work assumes that the dynamics matrix is able to be factorized, which is hard to achieve in real practices. Besides, in a two-agent team with one teacher, it is actually not involved the teamwork. The case would be different when multiagent team members are engaged in different roles. Either learning from a “teacher” or optimal state trajectories limits the application in real cases. In most engineering tasks, those resources are unavailable. We desire that a novice agent team could master a good teamwork strategy without any external aids. 34 The authors in [105] address individual learning with consideration of others’ rewards. In training, each individual agent is pre-trained with a population of others in the same task, which is seen as prosocial agents. In that way, a team with prosocial agents can output better teamwork. However, pro-sociality carries a cost of equilibria and variance issues. In [106], agents are trained with a given shared objective reward function in a cooperative environment. However, the results are restricted in two-agent settings, which is relatively easier to coordinate the individual and group rewards. Nevertheless, it can be seen that unavoidable drawbacks are introduced when social rewards are applied to assist agent groups in mastering strategies. Firstly, handcrafting a social reward is hard, especially defining social rules ahead of the training. But minor approximation error scans can be easily amplified when planning on these faulty signals [107] , which leads to a bigger risk of task failures. Secondly, learning to maximize their own reward may fall into a pitfall that conflicts with group interests [107]. When both individual and social interests are designed in one reward function, how to balance them is challenging, but agents are more inclined to self-interests. Additionally, appending social abilities in a team means higher computational costs, wider communication bandwidth, and more sensory effort. Therefore, investigating the social mechanisms in MAS is demanding. Knowing the agent teams’ social behavior codes could benefit the system design to be free and efficient. Based on the above, this thesis aims to deepen the understanding of social mechanisms and reveal the emergent group behavior codes without pre-designing the social mechanism or introducing external “teachers.” We are trying to interpret the natural characteristics of the RL-based agent teams when they utilize neural networks as function approximators and the effects of their social ability. It is hoped to be fundamental for furthering social learning studies in complicated engineering task domains. 35 2.6. Knowledge Transfer Learning The idea of knowledge transfer has been studied widely. Many researchers have contributed to developing algorithms to assist the agent in learning fast and mastering more complex tasks. In [108], the authors propose REPrepresentation and REPAINT algorithms for knowledge transfer in deep RL to accelerate learning processes. The algorithm not only employs the pre-trained policy but also utilizes an advantage-based experience selection method. The idea of selecting samples based on their relatedness help an agent to choose useful samples and conduct more effective learning. In [109] authors propose Actor-Mimic, which is a transfer RL approach to mimic expert decisions for multi-task learning. In [89], the authors have studied the transfer of knowledge from simple tasks to relatively complex tasks and applied the learned knowledge as a “teacher” to lead a novice agent to become more capable. It is found that ‘copy expert’ can help increase the learning efficiency for high similarity tasks but is not efficient for more complex tasks and with low similarity. However, the works focus on single-agent learning. As one more agent is engaged in the task, the situations of knowledge transfer become more complicated. Some research states that two tasks can be considered to be similar when they share the same state-action space [110] [111]. Glatt et al. found that the similarity between tasks plays an important role in the success of knowledge transfer in applications [112]. It can improve the learning speed for a new task. In MARL, the knowledge is embedded in a cluster of value functions or neural networks. An individual’s policy is not only relevant to the task environment but also depends on other agents’ actions. In [113], the authors focus on MARL knowledge transfer and let novice agents to learning from the environment and teachers parallelly. The method is flexible on the network structure. However, it assumes that well-trained policy has high quality. In complex engineering tasks, it is 36 hard to have a good “teacher” that already masters the task domain knowledge well. In [114], authors adopt game theory-based MARL and conduct value function transferring in the learning process. However, the method utilized the agents with single-agent knowledge (e.g., local value function). However, the complication of the knowledge transfer process in MARL and the latent cooperation mechanism among teams makes a successful transfer challenging. Therefore, an in-depth understanding of knowledge transfer mechanisms among different teams becomes crucial for complex tasks. Hence, this thesis will concentrate on exploring the knowledge transfer among teams of varying sizes. Additionally, the effects of knowledge quality and the team’s social ability on the transferring process will be examined. By delving into those aspects, we aim to gain deeper insights into the dynamics of knowledge transfer and its implications for different teams. The findings are expected to assist the next steps in knowledge transfer in different task domains. 37 Chapter 3. Methodology 3.1. Reinforcement Learning Reinforcement learning (RL) is a paradigm that makes an agent learn from trials in the process of maximizing the reward. In a finite MDP, a tuple of < 𝑆 ,𝐴 ,𝑃 ,𝑅 ,𝛾 > describes the interaction between an agent and an environment in a sequence of discrete-time steps [41]. At each time step t, the agent selects an action 𝐴 𝑡 ∈ 𝐴 based on some representation of the environment’s state 𝑆 𝑡 ∈ 𝑆 it receives [41]. The dynamics of the MDP are defined by a transition matrix P that maps the state and action to the next step ’s state and rewards with a certain probability. R defines a reward function that tells numerical rewards that agents can receive based on their states and actions, and γ is a discount factor that is used for calculating return values, 𝐺 𝑡 = ∑ 𝛾 𝑇 −𝑡 −1 𝑅 𝑡 𝑇 𝑡 +1 , where 𝑡 stands for current timestep and 𝑇 represents the timestep when an episode ends. An agent receives an immediate reward 𝑅 𝑡 at each time step, its goal is to maximize the total rewards it receives, which means the cumulative reward in the long run [41]. The problem of solving an MDP is to find a policy 𝜋 that can maximize the accumulated reward [71]. Q-learning [115] is a popular RL algorithm that updates the cumulative rewards of actions based on the temporal difference of value estimations: 𝑄 ( 𝑆 𝑡 ,𝐴 𝑡 )← 𝑄 ( 𝑆 𝑡 , 𝐴 𝑡 )+ 𝛼 [𝑅 𝑡 +1 𝛾 𝑚𝑎𝑥 𝛼 𝑄 ( 𝑆 𝑡 +1 ,𝛼 )− 𝑄 ( 𝑆 𝑡 − 𝐴 𝑡 ) ] ( 3 ) Deep Q-learning [16] [116] has been developed in recent years to replace the Q-table with a Q- network with weights 𝜃 𝑖 to process complex situations in an end-to-end way. The Q-network updates its weights 𝜃 𝑖 at each iteration 𝑖 by minimizing the loss function shown as: 38 𝐿 𝑖 ( 𝜃 𝑖 )= 𝐸 [( 𝑅 𝑡 +1 + 𝛾 𝑚𝑎𝑥 𝛼 𝑄 ( 𝑆 𝑡 +1 ,𝛼 )− 𝑄 ( 𝑆 𝑡 ,𝐴 𝑡 ) ) 2 ] ( 4 ) 3.2. Multiagent Reinforcement Learning While the single-agent RL is concerned with one agent, multiagent reinforcement learning (MARL) addresses the RL of multiple agents in a shared environment. In MARL, joint action learners and multiple individual learners are two common approaches to solving the problem [117]. Joint action learners apply specific algorithms for multiagent systems (MAS) to learn joint actions. However, it suffers an exponential increase in computational resources since its value functions consider all the possible combinations of actions by all agents [118]. In the case of multiple individual learners, each agent of the system performs its single-agent reinforcement learning algorithm [115], and the system actions are a joint set 𝐴 = 𝐴 1 × ∙∙∙ × 𝐴 𝑛 [119], where n is the number of agents. The joint policy depends on all agents’ learned policies, 𝛱 𝑆 ~ 𝛱 𝑆 1 × 𝛱 𝑆 2 × … × 𝛱 𝑆 𝑁 . The algorithm we chose for training is deep-q learning, as introduced in the last section. In cooperative tasks, all agents share a team reward function for a common goal but train separate neural networks to behave and learn their own roles in the teamwork [15]. My work takes a multi-individual learner’s approach and considers cooperative task domains. For such MARL tasks, the goal is to maximize the accumulative reward resulting from all agents’ actions over time. Their separate neural networks contribute to capturing behavior codes on how to benefit the team and impart agents with the intelligence of interacting with a dynamic environment influenced by others. The trained neural networks can be reused for different teams 39 of the same size, or even different sizes, in similar tasks if agents are homogeneous [3]. The pseudo-code for multiagent deep Q-learning is shown in Table 2. Table 2. Pseudo code of Multiagent Deep Q-learning (MA-DQL). 1. Initialize number of agents N, number of episodes M 2. for number of episodes from 1 to M, do 3. Initialize environment E, replay buffer Z, and state s 4. Initialize network θ and target network θ target 5. Set hyperparameter ε, target network frequency f, minibatch size J, total timestep T 6. for timestep t from 1 to T, do 7. Get joint actions a ← Q( s,a| θ) based on ε − greedy policy 8. Execute joint actions 𝑎 to get new state s ′ and reward r 9. Save < s,a,r,s ′ > in replay buffer Z 10. for number of agents i from 1 to N, do 11. if timestep t // 5 == 0, do 12. Sample minibatch < s j ,a j , r j ,s ′j > from replay buffer Z 13. If timestep // f == 0: replace θ target I with θ t i 14. Compute y = r t i + γmax a ′ Q( s ′ I ,a ′ | θ target I ) 15. Compute loss function L t i ( θ t i )= E[y − Q( s t I ,a| θ t i ) ] 2 16. Update θ t I based on gradient descent of loss function 17. S i ← s ′i 40 18. timestep t += 1 19. End 3.3. MASo-DQL: A Framework of Social Learning in MARL In order to overcome the shortcomings of previous models [120], a novel framework is introduced in this session for the social learning studies in MARL, named Multiagent Social Deep Q-learning (MASo-DQL). It works as a new modeling architecture to inject social learning in the multiagent deep Q-learning to impart social abilities to agents. The architecture can be compatible with different RL algorithms besides deep Q-learning, which retains certain flexibility in different cases. Besides, the model is flexible with various organizations and different levels of social abilities. Furthermore, it can scale up the team size easily without changing the architecture. In that case, the computational complexity in terms of time and space can be maintained. The MASo-DQL involves a novel parameter to present social rules, named 𝐾 . It defines the rules of how agents interact with each other. Each agent is allowed to hold its own social rules, noted as 𝐾 𝑖 , where 𝑖 is the i th agent in a team. Thus, the overall social relations are formed by all agents, presenting as 𝐾 ~ 𝐾 1 × … × 𝐾 𝑁 . The joint action is set as the same as in the previous section, as 𝐴 ~ 𝐴 1 × ∙∙∙ × 𝐴 𝑁 . To save the social information temporally, a social information batch is set as O, and each agent’s social information is set as 𝑂 𝑖 . The social information will be updated at every step of the training. In individual learning, agents operate in isolation, ignoring the presence of other agents, and only rely on environmental information for learning. From an individual’s perspective, each agent only 41 receives task-related signals, such as features of environments, task progress, and objects in the field. We term this perspective as “Task Views” to classify the task-related information received by a robot. In contrast to individual learning, social learning encourages team members to learn from internal interactions. It enables each agent to learn not only from task-related information but also from signals emitted by other team members. This aligns well with real-world scenarios, such as the box-push task implemented by a group of robots. In this context, each robot is equipped with sensors capable of transmitting and receiving signals. This facilitates the integration of social information into the training process. Thus, to categorize the sources of information related to team members, we define “Social Views” as shown in Fig. 6. These Social Views encompass various social aspects, including the agents' IDs, names, plans, neighbors, etc. Robots with Social Views can actively engage in social interactions. Besides, robots can possess both task views and social views simultaneously in training, which learn from richer information, such as team members’ positions in the environment. The specific information varies due to different situations and hardware features such as sensor range. 42 Figure 6. Task view and social views of a robot. When agents possess social views, broader information and signals can be captured to feed in MASo-DQL for training. However, it is crucial to consider that the quantity and representation of social information may yield different effects on learning behaviors. To address this issue, the social capabilities are categorized and visualized in a distributed map based on the levels of social abilities among agent teams within the context of assembly tasks. The categorization of the social levels is depicted in Fig. 7. Figure 7. Distributed map of social levels. Social levels assess how much social information obtained from social views is recognized by an agent. It starts from the 0-level referring to “No social,” meaning that an agent has no social view, to the “complete social,” meaning that an individual has abilities to know “all” of social information. In this thesis, the social capabilities are characterized and ranked in seven levels for current and future studies, which are proposed below. • S0: No Social (NS) – No social abilities. • S1: Observe Neighbors (ON) – Agent can observe its neighboring areas within a pre- defined range and receive signals that are transmitted within that range. 43 • S2: Observe Neighbor Mapping (ONM) – Agent can observe its neighboring areas within a pre-defined range, receive the transmitted signals, meanwhile differentiate the senders’ identities. • S3: Observe Neighbor Selected (ONS) – Agent can selectively observe its neighboring areas based on its individual interests. • S4: Observe Neighbor Mapping Selected (ONMS) – Agent can selectively observe its neighboring areas based on its individual interests meanwhile differentiate the senders’ identities. • S5: Know Policies (KP) – Agent can know other agents’ policies. • S6: Know Policy Mapping (KPM) – Agent can know the mapping between agents and their respective policies. • S7: Know Policy Selected (KPS) – Agent can selectively know the mapping between agents and their respective policies based on their individual interests. (a) S0-NS (b) S1-ON (c) S2-ONM (d) S3-ONS (e) S4-ONMS Figure 8. Social level examples. 44 Fig. 8 presents how the different social levels are engaged in the assembly task, where the black robot icon is the “own” robot. Besides, we use correct and cross marks to show its observational ranges and colored robot icons as identified team members. Fig. 8-(a) shows the “No Social” scenario that agents only have a task view but cannot observe other members. It can be seen that all areas around the box are blind to the “own” robot. Fig. 8-(b) presents one scenario of social abilities to observe neighbors as S1-ON. This allows the agent to acquire interactive information about its neighbors within the designated ranges. As the example shown in the figure, the robot can observe its adjacent left and right areas and know whether those areas are influenced by any other actions, but it does not know the specific robots responsible for the actions. This type of knowing is presented as the correct marker. Fig. 8-(c) presents an example for teams with S2-ONM, observing neighbor mapping, meaning that the agent can gather information about its surroundings and establish an awareness of the specific agents involved in the signal transmission process. The agent’s identity can be presented by IDs or Names (shown as colored robot icons). As the social level increases, Fig. 8-(d) shows that the agent's observational scope is influenced by its specific interests and priorities. For instance, the agent may prioritize the opposite position due to the potential conflict arising from a push in that location. In this context, the higher social level shown in Fig. 8-(e), empowers the agents with the ability to identify the specific ID or name of the agent responsible for the push within their interested areas. Furthermore, in order to integrate these social abilities into the learning agent team, the new framework, MASo-DQL, is introduced. MASo-DQL is constructed based on multiagent deep Q- 45 learning, which has been introduced in the last section. It involves a novel parameter to present social rules, named 𝐾 , which defines the rules of how agents interact with each other. For each agent, it can hold its own social rules, noted as 𝐾 𝑖 , where 𝑖 is 𝑖𝑡 ℎ agent in the team. Therefore, the overall social norm formed by all agents can be presented as 𝐾 = 𝐾 1 × … × 𝐾 𝑁 . The joint action is set as the same in Chapter 3.2, as 𝐴 = 𝐴 1 × ∙∙∙ × 𝐴 𝑁 . To save the social information temporally, a social information batch is set as 𝑂 , and each agent’s social information is set as 𝑂 𝑖 . For individual agents, the state can be reconstructed as 𝑆 𝑆 = 𝑆 𝐸 ⌢ 𝑆 𝐾 to contain full information, where 𝑆 𝐸 is the environmental state of the task environment and 𝑆 𝐾 is the social state following the social rule 𝐾 , obtaining from the social information batch 𝑂 . In that way, agents in a team may receive different social states based on their own state, interests, and social ability. Other definitions remain the same with MARL. The notation and definition of MASo-DQL are summarized in the table below. Table 3. Notation and definition of MASo-DQL. Formula Social rules: 𝑲 𝐾 ~ 𝐾 1 × 𝐾 2 × … × 𝐾 𝑁 ( 5 ) Definition The rules that define how agents interact with each other. K i are ith agent’s social rules. Formula Joint action: 𝑨 𝐴 ~ 𝐴 1 × 𝐴 2 × … × 𝐴 𝑁 . ( 6 ) Definition A i is the ith agent’s action space. a t i is ith agent’s action at t timestep. Formula Social information batch: 𝑶 𝑂 ~ 𝐾 × 𝐴 ( 7 ) Definition O 1 , O 2 , … , O N are each agent’s social information batch. Formula State: 𝑺 𝑺 . 𝑆 𝑆 = 𝑆 𝐸 ⌢ 𝑆 𝐾 , 𝑆 𝐾 ← 𝑂 ( 8 ) Definition It describes the current situation. S S is the state for current step. S E is the environmental state from the task environment. S K is the social state following the social rule K, getting from the social information batch O. 46 Formula Reward: 𝐑 . 𝑅 ~ 𝑆 𝑆 × 𝐴 ( 9 ) Definition It defines the goal and desirability of tasks. Formula Joint policy: 𝚷 𝑺 . Π 𝑆 ~ Π 𝑆 1 × Π 𝑆 2 × … × Π 𝑆 𝑁 ( 10 ) Definition A policy defines the learning agent’s way of behaving at a given time. Π S i is the ith agent’s policy with social learning. Formula Social Deep Q-learning (So-DQL) loss function for the ith agent at t timestep 𝐿 𝑡 𝑖 ( 𝜃 𝑆 𝑡 𝑖 )= 𝛦 𝑠 𝑆 , 𝑎 , 𝑟 , 𝑠 ′ 𝑆 ∼𝑟𝑒𝑝𝑙𝑎𝑦 𝑏𝑢𝑓𝑓𝑒𝑟 𝑍 [(𝑟 𝑡 𝑖 + 𝛾 𝑚𝑎𝑥 𝑎 ′ 𝑄 ( 𝑠 𝑆 𝑡 𝑖 ′ , 𝑎 ′ | 𝜃 𝑡 −1 𝑖 )− 𝑄 ( 𝑠 𝑡 𝑖 𝑆 , 𝑎 | 𝜃 𝑡 𝑖 ) ) 2 ] ( 11 ) Definition It is used to determine the error between the output of our algorithms and the given target value. Formula Optimal policy of the ith agent: 𝜋 ∗ 𝑖 𝑆 ∈ arg𝑚𝑎𝑥 𝜋 𝑖 [𝛦 [∑𝛾 𝑇 −𝑡 𝑟 𝑡 𝑖 | 𝑠 𝑆 0 𝑇 𝑡 ]] ( 12 ) Definition An optimal policy will have the highest possible value in every state. The simulation architecture is shown in Fig. 9. It can be seen that social buffer queries and storages the information from the environment and the team. Through processing the information following the social rule 𝐾 , the social information is allocated in batches for agents. Afterward, the agent acquires its own social information as part of the state and appends them to the environmental state. The state, including both environmental and social facets, is fed into neural networks for training. 47 Figure 9. Simulation system architecture of MASo-DQL. The pseudo-code is presented in Table 4. Table 4. Pseudo code of Multiagent Social Deep Q-learning (MASo-DQL). 1. Initialize the number of agents N, number of episodes M 2. Set Social norm K 3. for number of episodes from 1 to M, do 4. Initialize environment E, replay buffer Z, social buffer O, and state S S 5. Initialize network θ and target network θ target 6. Set hyperparameters ε, target network frequency f, minibatch size J, total timestep T 7. for timestep t from 1 to T, do 8. Get joint actions a ← Q( s,a| θ) based on ε − greedy policy 9. Get s ′ K based on K from social buffer O 48 10. Save social information < o t+1 > based on K in social buffer O 11. Execute joint actions a to get environment state s ′ E and reward r 12. s ′ S ← s ′ E ⌢ s ′ K 13. Save < s S ,a,r, s ′ S > in replay buffer Z 14. for number of agents i from 1 to N, do 15. if timestep t // 5 == 0, do 16. Sample minibatch < s S j ,a j , r j , s ′ S j > from replay buffer Z 17. If timestep // f == 0: replace θ target I with θ t i 18. Compute y = r t i + γ max a ′ Q ( s ′ S ,a ′ | θ target I ) 19. Compute loss function L t i ( θ t i )= E[y − Q( s S t I ,a| θ t i ) ] 2 20. Update θ t I based on gradient descent of loss function 21. s I S ← s ′ S i 22. timestep t += 1 23. End 3.4. Design of Shaping Reward Fields As introduced in previous sections, reward functions provide a numerical score to offer a gradient for an RL algorithm’s iteration, guide agents’ learning, and encourage exploration. To solve complex tasks, agents need more information for effective learning, where reward shaping is a technique to provide additional reward signals that can improve agents’ learning efficiency 49 and final performance [70] [121]. Some approaches to shaping rewards have been introduced in Chapter 2.3. Among them, potential-based reward shaping provides a way to incorporate knowledge from different sources to guide search in reinforcement learning. However, they suffer difficulties in practical implementation due to their high requirements for designers’ knowledge. In the potential-based reward shaping in equation (1), the potential field 𝛷 is built manually by designers based on their understanding and knowledge of the tasks. It asks designers to possess high-level and completed patterns of the task and transfer that knowledge into mathematical fields in terms of state and action. Those factors make the approach difficult and sometimes impractical, especially when a task has high complexity. Considering that a field is efficient for incorporating heuristic knowledge and provides the gradient to deliver the information [61], this thesis reforms the potential-based reward shaping and constructs a multivariate field by utilizing heuristic knowledge from designers. The field is presented below. 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ 𝑓 ( 𝑡 )∙ 𝛷 ( 𝑠⃗) ( 13 ) where 𝛷 ( 𝑠⃗) is a multivariate field in terms of selected state signals, 𝑓 ( 𝑡 ) represents the task progress in controlling the iteration steps. For a specific task, designers are supposed to have some heuristic knowledge of the task [61], such as task goals and detectable signals. Those are valuable sources to enrich the information in the system. My methodology targets transiting those from human to intelligent agents. The multivariate field 𝛷 ( 𝑠⃗) is built in terms of selected state signals, which contain certain information and are detectable, such as task goals. By transferring the information into some mathematical 50 formations, the knowledge is embedded in the system, and it becomes flexible on the knowledge amount. The process controller 𝑓 ( 𝑡 ) helps fit different phases of the task. Therefore, the method has a high tolerance for incomplete known situations and flexibility in complex cases. It is worth mentioning that the proposed reward shaping methodology is applicated based on several prerequisites. • The selected signals in the shaping reward are assumed to be known by agents or robots in some ways, either from their own sensors or global controllers. • The algorithms allow some values to be inputted as hyperparameters to assist in shaping the task preferences. • The preferences of all selected signals do not exist in conflict, but the signals are NOT required to be independent of each other. Aiming to solve the difficulties of reward function design in assembly tasks, the remodeled shaping reward multivariate fields are designed for several improvements. First, it provides a universal way for assembly scenarios with geometrical precision requirements. Secondly, unlike the sparse problems that previous methods may face, it has a high tolerance to be effective when designers hold incomplete task knowledge. Thirdly, it is flexible to be scaled up to fit cases like higher task complexity or heterogeneous agents. The implementation and validation of the methodology will be introduced and discussed in Chapter 4. 51 3.5. Complexity Measurement Task complexity has been recognized as an important characteristic [98] that is relevant to its solution design and predictive outputs. Currently, it has no consensus on a general definition [94] [122] that works for all due to various features and viewpoints in different fields. Task complexity is crucial to understand task characteristics when designing a complex system, which determines the performance of the task performer in a hierarchical system [123]. Even though there is no universal definition for task complexity [98], we still can build a method to estimate the task complexity of some engineering tasks for analysis and comparison, which captures its features. In this thesis, we focus on solving assembly tasks involving collision avoidance. Thus, the task mainly involves various elements and objects in the environment and their geometrical features. As known in Chapter 2.4, task complexity can be divided into objective and subjective complexity. While objective task complexity considers elements related directly to task characteristics and independent of task performers, subject task complexity considers both task characteristics and task performers [94] [124]. Since our goal is to assist system design that requires understanding task features and mapping performers to the task, the objective task complexity is applied in this work. This implies that the measurement of complexity only considers the complexity contributors relevant to the specific tasks. Therefore, we establish a task model specifically for the assembly tasks involving collision avoidance. This task model is structured in a manner that aims to facilitate the design of task performers, thereby supporting the understanding of the learning mechanism related to task complexity. Based on the above, a task model can organize as below: 52 • INPUT: objects in the environment and their corresponding physical properties. • PROCESS: the requirements pertaining to the task process, such as physical requirements and collision avoidance requirements. • GOAL: explicit “state” or “condition” that specifies the objective to be accomplished of the given task. The task complexity can be characterized by employing Complexity Contributory Factors (CCFs) [125], which are derived from prominent complexity contributors based on the task model. By utilizing this approach, a comprehensive assessment of task complexity can be achieved. Table 5 presents a list of CCFs that are applied to our studies. Table 5. CCFs in assembly tasks involving collision avoidance. Model CCFs Relation Definition INPUT Dimension of the task (DofT) Positive Degree of freedoms INPUT Number of static elements (NSE) Positive Static objectives in the task environment. INPUT Number of dynamic elements (NDE) Positive Dynamic objectives in the task environment. INPUT Number of relevant obstacles (NRO) Positive Obstacles in the solution area PROCESS Physical requirements (PRs) Positive Requirements on displacement, rotation, friction, etc. PROCESS Collision avoidance requirements (CARs) Positive Requirements on avoiding collisions. GOAL Target configuration requirements (TRs) Positive Degrees of freedom. Following the list of all salient factors associated with assembly tasks involving collision avoidance, the task complexity can be estimated by considering their interrelationships. Since the 53 task complexity is estimated as the collective effects of INPUT, PROCESS, and GOAL complexities, it is defined as a combination of the three complexities shown in the equation shown in equation (14). 𝑇𝑎𝑠𝑘 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 = 𝐶 ∙ 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 ∙ 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐺𝑅𝐸𝑆𝑆 ∙ 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 ( 14 ) where 𝐶 is a constant hyperparameter. In equation (14), the INPUT, PROCESS, and GOAL complexities are estimated based on their own influencers listed in Table 5, and the calculations are shown below. It is known that more numbers of objects in the field mean a narrower space for the assembly task. Thus, the complexity of INPUT considers areas of static and dynamic elements and obstacles, shown in equation (15). The nearer their positions to the goal, the higher complexity they contribute. The collective influences are normalized by the overall area of the field. 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 = [ 1 𝑐 𝑎𝑟𝑒𝑎 (∑ 𝑠𝑒 𝑖 ̃ ∙ 𝑎𝑟𝑒 𝑎 𝑖 𝑁𝑆𝐸 𝑖 + ∑ 𝑑 𝑒 𝑗 ̃ ∙ 𝑎𝑟𝑒 𝑎 𝑗 𝑁𝐷𝐸 𝑗 + ∑ 𝑟𝑜 𝑘 ̃ ∙ 𝑎𝑟𝑒 𝑎 𝑘 𝑁𝑅𝑂 𝑘 )] 𝐷𝑜𝑓𝑇 ( 15 ) where, 𝑐 𝑎𝑟𝑒𝑎 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 .ℎ𝑒𝑟𝑒 , 𝑐 𝑎𝑟𝑒𝑎 = 1000 ( 16 ) 𝑠 𝑒 𝑖 = ( 𝑡 ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑒𝑎𝑐 ℎ 𝑠𝑡𝑎𝑡𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑡𝑜 𝑡 ℎ𝑒 𝑔𝑜𝑎𝑙 ) −1 ( 17 ) 𝑑𝑒 𝑗 = ( 𝑡 ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑒𝑎𝑐 ℎ 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑡𝑜 𝑡 ℎ𝑒 𝑔𝑜𝑎𝑙 ) −1 ( 18 ) 𝑟𝑜 𝑘 = ( 𝑡 ℎ𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑟𝑜𝑚 𝑡 ℎ𝑒 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑏𝑠𝑡𝑎𝑐𝑙𝑒 𝑡𝑜 𝑡 ℎ𝑒 𝑔𝑜𝑎𝑙 ) −1 ( 19 ) 𝑠𝑒 𝑖 ̃ 𝑑𝑒 𝑗 ̃ 𝑟𝑜 𝑘 ̃ 𝑎𝑟𝑒 𝑡 ℎ𝑒 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑛𝑑 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑎𝑛 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑓𝑎𝑐𝑡𝑜𝑟 𝑖𝑓 𝑛𝑒𝑐𝑒𝑠𝑠𝑎𝑟𝑦 . 𝑎𝑟𝑒 𝑎 𝑖 ,𝑎𝑟𝑒 𝑎 𝑗 , 𝑎𝑟𝑒 𝑎 𝑘 𝑎𝑟𝑒 𝑡 ℎ𝑒 𝑎𝑟𝑒𝑎𝑠 𝑜𝑓 𝑡 ℎ𝑒 𝑐𝑒𝑟𝑡𝑎𝑖𝑛 𝑜𝑏𝑗𝑒𝑐𝑡 . 54 The process complexity and goal complexity are estimated in the ways shown in equations (20) and (21) separately. Process complexity focuses on the collision avoidance requirements during the implementation process, while GOAL complexity focuses on the goal state. 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐺𝑅𝐸𝑆𝑆 = 𝐶𝐴 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠 ∙ 𝑃 ℎ𝑦𝑠𝑖𝑐𝑎𝑙 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠 = 𝐶𝐴𝑅𝑠 ∙ 𝑃𝑅 𝑠 ( 20 ) 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 = 𝑇𝑎𝑟𝑔𝑒𝑡 𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠 = 𝑇𝑅𝑠 ( 21 ) 55 Chapter 4. Reward Shaping and its Impact on Self-organizing Systems 4.1. Introduction In MARL, reward functions play a crucial role in encouraging exploration and provide a gradient in the learning process, which greatly influences training efficiency, learning speed, and task performance. For agents to learn about their action behaviors for relatively complex tasks from the reward functions, one needs to understand the properties of various types of reward functions for given task domains. While reward design highly relies on designers' instincts, experiences, observations, and prior knowledge [126], the lack of a systematic understanding of how reward functions and problem properties interact may lead to systems inefficiency and failure risks. The design situation is usually full of challenges of limited resources. It is meaningful to seek ways of a better information carrier and fill the understanding gap between human beings and artificial agents. It is a necessary step toward solving more complex engineering tasks. Therefore, this chapter works on exploring the problem of reward function design in the context of engineering assembly tasks, especially on reward shaping field building introduced in Chapter 4.4. It investigates how various reward signals and different forms of reward functions may impact the agent teams' training and task performance. Thus, this chapter addresses the following questions: How do different reward shaping fields impact the task performance of agent teams? How may such an impact interact with agent team sizes? 56 4.2. Task Description In order to empirically investigate how reward shaping influences the learning process and the task performance of MARL self-organizing systems, a specific task is introduced in this section, and the shaping reward design issues for the task are explored. The Task: The target problem is an “L-shape” assembly task, in which an agent team learns to push each part from separation to the goal configuration, i.e., the “L” shape. At the same time, the task requires agents to avoid collision with surrounding walls in the process. (a) task environment (b) task start and goal status Figure 10. Task of ‘L-shape’ assembly introduction. Fig. 10 shows the task environment of a 1000×1000 pixels field. The upper-left rectangle is a dynamic box that agents can push, and its mass equals 1 and size is 180×60. The middle-right rectangle is a static box that is a target to form an ”L” shape. The agent team (the green squares in the figure) with limited sensing capabilities needs to spontaneously organize themselves to push the dynamic box towards the target box, end with an “L” shape, and avoid the box colliding with the surrounding walls. Detailed settings of the task environment are introduced in Table 6. Table 6. Environment settings in task: “L-shape” assembly. 57 Field size (pixel) 1000 * 1000 Box size (pixel) 180 * 60 Target box size (pixel) 180 * 60 Box start center coordination (150, 180) Target box center coordination (950, 500) Box mass (kg) 1 Push impulse (N∙s) 1 In assembly tasks, the precision of the final configuration is important. To achieve the task's goal, robot teams are supposed to master strategies not only about the dynamic box's trajectory but also its rotation and displacement. They need to cooperatively control the box's orientation along with the final "L" configuration. When all agents push the box together at each step, the box's movement depends on joint impulses. That increases tasks' difficulty and puts a higher requirement on the group's intelligence. Considering the above, two boxes are initiated with different orientations and put away at a certain distance in the field, which provides agent teams with space to try and adjust pushing strategies. 4.3. MARL Modeling and Reward Function Design In the implementation, the agents are homogenous with the same action space and sensing capability. To complete the task, they self-organize themselves to work cooperatively in a team and are controlled by their own neural networks. During the training, they consider other agents’ influence as a part of the environment. The self-organizing system is trained by deep Q-learning, and an 𝜀 –𝑔𝑟𝑒𝑒𝑑𝑦 strategy is applied to explore optimal policy [15]. The hyperparameters of the algorithm are shown below in Table 7. 58 Table 7. Hyperparameter settings. Training episodes 16,000 Discount factor 0.99 Memory buffer size 1000 Mini-batch size 32 Target network update frequency 200 Learning rate 0.001 Neural network size (63, 64, 128, 6) epsilon 1 → 0.01 State Space and Action Space: The state space is set as a 63-digit tuple, 𝑆 = < 𝑣𝑖𝑐𝑖𝑛𝑖𝑡𝑦 𝑠𝑖𝑡𝑢𝑎𝑡𝑖𝑜𝑛 , 𝑣 𝑥 , 𝑣 𝑦 ,𝑣 𝑎𝑛𝑔𝑢𝑙𝑎𝑟 >. covering the information of the vicinity situation around the Dynamic Box and Dynamic Box’s velocity in x-direction, y-direction, and angular velocity. The vicinity situation is sensed by sensors on the box in the range of 200 pixels. The action space is as: 𝐴 = < 𝑎 1 , 𝑎 2 , 𝑎 3 , 𝑎 4 , 𝑎 5 , 𝑎 6 >, ( 22 ) where 𝑎 i separately represents pushing the box with 1 𝑁 ∙ 𝑠 impulse in six different positions: two on each of the two long sides of the box and one on each of the two short sides. Local Reward: The local reward includes a rotation reward 𝑅 𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 , which works to control the rotational dynamic, as shown in equation (23). 𝑅 𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 = 𝐶 𝑟 ∙ ( 𝑐𝑜𝑠 ( ∆𝜌 𝑖 )− cos ( ∆𝜌 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ) ) ( 23 ) 59 where 𝜌 is the angle of the dynamic box’s rotation. ∆𝜌 𝑖 is the change of rotational angle of the box at step 𝑖 . ∆𝜌 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 is a hyperparameter that is set manually before the task begins. It represents an angle change threshold for receiving punishment when exceeded; it is set as 11° in all cases. 𝐶 𝑟 is a hyperparameter that works to tune the collective effect with other rewards. It is set as 10 in our cases. Although ∆𝜌 𝑖 is globally available information, each agent observes its own values and updates its policy. This allows us to introduce local noises and assess the impact of partially observable states. Global Reward: The global rewards in this study are calculated based on the global parameter values and fed to each agent simultaneously as a reward signal. They include goal rewards 𝑅 𝑔𝑜𝑎𝑙 , distance rewards 𝑅 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 and collision rewards 𝑅 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 . The goal reward 𝑅 𝑔𝑜𝑎𝑙 gives a big positive reward when the agents achieve the goal (form an “L” shape with two boxes). It is set as: 𝑅 𝑔𝑜𝑎𝑙 = 𝐶 𝑔𝑜𝑎𝑙 ∙ 𝛪 𝐺𝑂𝐴𝐿 ( 24 ) where Ι 𝐺𝑂𝐴𝐿 is a unit indicator of the event GOAL (the “L”-assembly goal is achieved). 𝐶 𝑔𝑜𝑎𝑙 is a hyperparameter and is set as 1000 in our cases. The collision reward 𝑅 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 works to avoid any collisions that happen during the task, and it is set as follows: 𝑅 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 = 𝐶 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 −𝑤𝑎𝑙𝑙 ∙ 𝛪 𝐶𝑂𝐿𝐿𝐼𝐷𝐸 −𝑤𝑎𝑙𝑙 + 𝐶 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 −𝑜𝑏𝑠 ∙ 𝛪 𝐶𝑂𝐿𝐿𝐼𝐷𝐸 −𝑜𝑏𝑠 ( 25 ) where 𝛪 𝐶𝑂𝐿𝐿𝐼𝐷𝐸 −𝑤𝑎𝑙𝑙 is a unit indicator of the event COLLISION_WALL (the dynamic box collides onto the walls) and Ι 𝐶𝑂𝐿𝐿𝐼𝐷𝐸 −𝑜𝑏𝑠 an indicator of the event COLLISION_OBS (the 60 dynamic box collides with the obstacles). The 𝐶 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 −𝑤𝑎𝑙𝑙 and 𝐶 𝑐𝑜𝑙𝑙𝑖𝑑𝑒 −𝑜𝑏𝑠 are hyperparameters and are set as -100 and -200, respectively, in all cases. The distance reward 𝑅 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 is designed to encourage agents to push the box toward the target box and punish them if they go in the opposite direction. It is set as: 𝑅 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 𝐶 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∙ ∆𝑑 𝑖 ( 26 ) where ∆𝑑 𝑖 is the distance difference from the current box’s position to the target box at step 𝑖 . It is a positive number when the box moves nearer to the target at step 𝑖 . 𝐶 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 is a hyperparameter and is set as 0.1 in our cases. Shaping Reward: The reward shaping field is designed to allow the exploration of multiple different shaping reward categories. Considering the task’s goal (“L-shape” configuration) and sensing capability, two parameters are selected to form a signal vector, 𝑠⃗ = < 𝛼 ,𝛽 >. The first one, named 𝛼 , is the box’s rotation angle from the vertical line, as shown in Fig. 11-(a). Since, in the beginning, the moving box’s longitudinal axis is aligned with the vertical (or north) line, the α angle changes in the range from 0° to 180°. The second parameter, named 𝛽 , is the angle between the horizontal line and the line connecting the centers of the two boxes, as shown in Fig. 11-(a). (a) angles in a pushing process (b) angles at the goal position Figure 11. Box self-angle α and relative angle β. 61 These two angles are chosen as reward signals because 1) they together fully define both the transient process (i.e., changes of the parameter values) and the final angular relation of the two boxes, while other reward terms are employed to reward either the local (e.g., the rotational angle change) or the final positions of the boxes (e.g., collision or goal states); 2) they both can be obtained directly from the sensors on the box, no need to add new sensors for the shaping purpose. From the geometrical relationship of the two boxes, we know that when the angle α equals 90° ( 𝜋 2 rad) and angle β equals 135° ( 3𝜋 4 rad), the final configuration is shaped as “L” perfectly, as shown in Fig. 11-(b). Thus, two hyperparameters can be defined in equations (27) and (28). 𝛼 𝑔𝑜𝑎𝑙 = 90° or 𝜋 2 𝑟𝑎𝑑 ( 27 ) 𝛽 𝑔𝑜𝑎𝑙 = 135° or 3 𝜋 4 𝑟𝑎𝑑 ( 28 ) Based on the above, one can devise mathematical functions to describe the task’s preference. And the shaping reward can be written as: 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ 𝑓 ( 𝑡 )∙ 𝛷 ( 𝑠⃗) ( 29 ) 𝛷 ( 𝑠⃗)= ℎ( 𝛼 )∙ 𝑔 ( 𝛽 ) ( 30 ) where 𝑓 ( 𝑡 ) represents the task progress in controlling the iteration steps. ℎ( 𝛼 ) and 𝑔 ( 𝛽 ) are functions that represent the task’s preference for parameters 𝛼 and 𝛽 , respectively. The simplest form that can be used for angle 𝛼 is: ℎ( 𝛼 )= |𝑠𝑖𝑛𝛼 | ( 31 ) 62 That is consistent with the system’s preference, which means it provides the max reward when the angle α equals 90°, and the reward gradually decreases when the angle α gets deviated from 90°. Some other forms that can be used for angle α are: ℎ( 𝛼 )= 𝑒𝑥𝑝 ( −𝐶 𝑎 ∙ |𝛼 − 𝛼 𝑔𝑜𝑎𝑙 |) ( 32 ) where 𝛼 𝑔𝑜𝑎𝑙 is a constant that means our training goal for angle α and 𝛼 𝑔𝑜𝑎𝑙 = 90°. 𝐶 𝑎 is a constant coefficient to tune the gradient. Compared to the reward functions without reward shaping terms, the reward shaping items provide additional information that is coded based on the system designer’s heuristic knowledge of how the system should move through the transient process (e.g., equations (29) and (30)) into the goal state (i.e., equations (27) and (28)). Therefore, reward shaping provides space and opportunities to explore the reward landscape to deepen our understanding of task dynamics from a learning perspective, making it possible for us to develop a reward field in which the agents can learn about the task and social fields [13] [35]. Similar steps can be carried out with the reward shaping with regard to the parameter 𝛽 . Specifically, 𝑔 ( 𝛽 ) can be set as: 𝑔 ( 𝛽 )= 𝑒𝑥𝑝 ( −𝐶 𝛽 ∙ |𝛽 − 𝛽 𝑔𝑜𝑎𝑙 |) ( 33 ) where 𝛽 𝑔𝑜𝑎𝑙 is a constant to indicate our training goal, and its value is shown in equation (27). 𝐶 𝛽 is a constant coefficient to tune the gradient. In this study, the resulting “L” configuration is good or not only can be known at the end of the task progress; thus, we set 𝑓 ( 𝑡 )= 1, meaning to check the configuration when the task ends. To 63 investigate how different reward shaping impacts the learning process and task performance, we examine different reward shaping fields of different forms. Following is a list of the reward shaping fields being studied. 𝑷𝟎 : 𝑁𝑜 𝑅𝑒𝑤𝑎𝑟𝑑 𝑆 ℎ𝑎𝑝𝑖𝑛𝑔 𝐹𝑖𝑒𝑙𝑑 ( 34 ) 𝑷𝟏 : 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ |𝑠𝑖𝑛𝛼 | ( 35 ) 𝑷𝟐 : 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ |𝑠𝑖𝑛𝛼 | ∙ 𝑒𝑥𝑝 ( −𝐶 𝛽 ∙ |𝛽 − 𝛽 𝑔𝑜𝑎𝑙 |) ( 36 ) 𝑷𝟑 : 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ 𝑒𝑥𝑝 ( −𝐶 𝑎 ∙ |𝛼 − 𝛼 𝑔𝑜𝑎𝑙 |)∙ 𝑒𝑥𝑝 ( −𝐶 𝛽 ∙ |𝛽 − 𝛽 𝑔𝑜 𝑎 𝑙 |) ( 37 ) 𝑷𝟒 : 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ 𝑒𝑥𝑝 ( −𝐶 𝑎 ∙ |𝛼 − 𝛼 𝑔𝑜𝑎𝑙 |) ( 38 ) 𝑷𝟓 : 𝑅 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 = 𝐶 𝑠 ℎ𝑎𝑝𝑖𝑛𝑔 ∙ 𝑒𝑥𝑝 ( −𝐶 𝛽 ∙ |𝛽 − 𝛽 𝑔𝑜𝑎𝑙 |) ( 39 ) Fig. 12 and Fig. 13 are visualizations of these fields, in which the x-axes are angles of α and/or β, and the y-axes are field values. (a) P1: α field only (b) P4: α field only (c) P5: β field only Figure 12. Illustrations of reward shaping fields P1, P4 and P5. 64 (a) P2: α & β field (b) P3: α & β field Figure 13. Illustrations of reward shaping fields P2 and P3. The difference between the |sin( 𝑥 − 𝐺 ) | field and exp( −( 𝑥 − 𝐺 ) ) field indicates two different training tendencies of the agents toward the goal. The former implies the fast-approaching early on and slow adjustments at the end around the goal. The latter is the opposite, slowly approaching early on and significant adjustments at the end. Furthermore, the cases of P1, P4, and P5 simulate situations where only a partial signal, 𝛼 or 𝛽 , is available for the purpose of reward shaping. Based on the state space, action space, and the reward function introduced previously in this section, the deep Q-learning algorithm (hyperparameters shown in Table 7) with the designed reward shaping fields (in equation (34) - (39)) inserted in the reward function can be applied to train. Agent teams learn to conduct the" L"-assembly task throughout thousands of episodes, and their optimal policies are saved and retrieved for outputting the task-performing procedure. The agent teams' performances are supposed to be different as the shaping rewards vary. In order to find effective ways of transiting the task information to artificial agents, the influences of various forms of reward shaping fields are analyzed from several perspectives. The experiments are designed for that goal and are introduced in the following section. 65 4.4. Experiment Design The physical dynamics of the box movement are simulated in pygame [127] and pymunk [128] models. The early simulation studies have demonstrated that the local rewards and global rewards alone frequently failed to guide agents to form needed configurations with desired precision in assembly tasks. Therefore, the reward shaping potential fields introduced above are investigated to further our understanding of the impact of various shaping potential fields in the context of different team sizes. Specifically, the experiment design intends to address the abovementioned research questions related to the impact of different reward shaping fields and their interaction with the size of agent teams. Figure 14. Experiment design for the shaping reward study. 4.4.1 Independent variables RS fields: The first research question asks how different kinds of reward shaping signals influence the learning process and, consequently, task performance. Although there can be unlimited forms of possible reward shaping functions, our study focuses on differentiating between “convex vs. concave” functions, “continuous vs. with a singularity (cusp)” functions, as well as “partial vs. complete” reward shaping signals. “Concave vs. convex” signifies different ways of approaching 66 task goals: fast-first-then-gradual or gradual-first-then-fast; “continuous vs. w/singularity” categorizes two groups of functions that treat the goal point significantly differently; and “partial vs. complete” checks if limited reward shaping is still meaningful. It is further expected that the impact of these RS fields will interact with the size of the agent teams. The independent variable “RS fields” shown in Fig. 14 has six possible variable formations with different properties (see Fig. 12 and Fig. 13 and equations through (34) to (39)): • P0 (baseline), • P1 (“concave”, “continuous”, “partial”), • P2 (“concave”, “continuous”, “complete”), • P3 (“convex”, “w/singularity”, “complete”), • P4 (“convex”, “w/singularity”, “partial”), • P5 (“convex”, “w/singularity”, “partial”), RS field gradients: Our initial simulation studies by varying RS fields from P0 to P5 revealed significant results with P3. In order to further explore the details of RS fields’ impact on the system learning and task performance, we introduced another independent variable, “RS field gradient.” This variable is composed of two hyperparameters 𝐶 𝛼 and 𝐶 𝛽 shown in equation (37). The possible values of these two parameters range from 0.1 to 10 to regulate the field gradients. The results are expected to give hints on designing an RS field with appropriate gradients for a certain task and control its output in a predicted way. Again, it is expected that the team size will interact with the gradients. 67 Agent team size: Our research question addresses the interaction with team size in MARL, which is a critical aspect to be considered in multiagent system design. For further understanding of how the impact of RS fields interacts with team size, “agent team size” is introduced as the third independent variable, as shown in Fig. 14. The possible values of “agent team size” are set to be 3, 5, 7, and 9 in the context of various RS fields. Given the task complexity, computational resources, and the box’s dynamic movements, the minimum team size is set as 3 due to the dynamic box shown in Fig. 11-(a) requires both displacement and rotation, and the maximum team size is set as 9 for current studies due to the referential experiences from previous studies [15]. As the task complexity grows in future studies, bigger sizes of agent teams will be considered. 4.4.2 Dependent variables Task performance, shown in Fig. 14, is an important aspect of evaluation. Task performance focuses on how well an assigned task is completed by learning agent teams. The following paragraphs introduce the general concepts of performance measures. Further details are described in the next section, with corresponding results and discussions. Task performance: As shown in Fig. 11-(b), we are especially interested in the Task performance measured as an “L” configuration of the trained agent teams since it indicates the quality of the assembly task. The collision avoidance requirement is fulfilled by the collision punishment in the reward function and the episode termination condition (terminate once collide on walls or obstacles). Therefore, by querying two parameters, 𝛼 and 𝛽 shown in Fig. 11, the Task performance can be presented and evaluated quantitively. As shown in Fig. 15, the task results of learning teams are presented in a coordinate system whose x-axis is angle 𝛼 , the y-axis is angle 𝛽 , and one dot presents one learning team’s task result. A perfect “L-shape” means 𝛼 = 90° and 𝛽 = 68 135° (defined in equations (27) and (28), represented by a red dot shown in Fig. 6. Other dots are the task results of the trained agent teams. The closer to the red dot the results are, the better quality of the final “L-shape” is. Figure 15. Illustration of final configuration scatter plots in the form of angle 𝛼 and angle 𝛽 . Furthermore, the statistical aspects can be assessed by calculating the Euclidean Distance between a learning team’s performance and the goal in the space of 𝑠⃗ = < 𝛼 ,𝛽 >, which can be expressed as: 𝐷 𝑖 = √ ( 𝛼 𝑖 − 𝛼 𝑔𝑜𝑎𝑙 ) 2 + ( 𝛽 𝑖 − 𝛽 𝑔𝑜𝑎𝑙 ) 2 ( 40 ) where 𝛼 𝑖 and 𝛽 𝑖 are obtained from the ith learning team’s final configuration. and 𝛼 𝑔𝑜𝑎𝑙 and 𝛽 𝑔𝑜𝑎𝑙 are known from equations (27) and (28). The Euclidean distance measure assists in understanding the impact of reward shaping quantitatively and further processing the results. Smaller distance values indicate better final configurations, while zero distance means a perfect “L-shape.” 69 4.5. Results and Discussion 4.5.1 Effect of reward shaping functions and signals As illustrated in Chapter 4.4, the experiments are conducted with varying RS fields from P0 to P5 in the context of team sizes of 3, 5, 7, and 9, respectively. For every group of experiments, we run 20 training cases with 20 different random seeds. The results of 3, 5, 7, and 9 team sizes are shown in Fig. 16 to Fig. 19. The team size effect will be discussed in Chapter 4.5.3. Quantitively, the mean of Euclidean distance and the standard deviation introduced in Chapter 4.4.2 are presented in an upper-left box in each plot. Figure 16. Final configurations for 3-agent teams: (a) P0-field, (b) P1-field, (c) P2-field, (d) P3-field, (e) P4-field, (f) P5-field. 70 Figure 17. Final configurations for 5-agent teams: (a) P0-field, (b) P1-field, (c) P2-field, (d) P3-field, (e) P4-field, (f) P5-field. Figure 18. Final configurations for 7-agent teams: (a) P0-field, (b) P1-field, (c) P2-field, (d) P3-field, (e) P4-field, (f) P5-field. 71 Figure 19. Final configurations for 9-agent teams: (a) P0-field, (b) P1-field, (c) P2-field, (d) P3-field, (e) P4-field, (f) P5-field. No Reward Shaping: As shown in equation (34), the P0-field works as a benchmark in which the agent teams are trained without a reward shaping field. The training results with P0-field are shown in Fig. 17-(a), showing cases with a team size of 5. Compared with other results, the final configurations with P0-field suffer a significant variance. All data points distribute sparsely and are far from the target position (red dot), which means no team can produce a good “L-shape.” It indicates that the agents can hardly find a good policy to push the box into the desired shape without additional guidance, even if the goal position has been imparted and rewarded with an enormous positive value through 𝑅𝑔𝑜𝑎𝑙 in the reward function. The results call for applying reward shaping fields to provide needed guidance for seeking the task performance of the trained agent teams. Partial Reward Shaping: As shown in equations (35) (38) (39), P1-field, P4-field, and P5-field provide the reward shaping information on either angle 𝛼 or angle 𝛽 , but not both. These cases simulate situations where system designers may have an incomplete understanding of the problem or only a partial reward signal is observable by agents. By incorporating either 𝛼 or 𝛽 in the reward shaping field, the results show that even partial shaping information can improve the training 72 quality and hence the task performance, but to different degrees. Fig. 17-(b), Fig. 17-(e), and Fig. 17-(f) are the training results with a team size of 5. They show that the data points become more concentrated in the target position (red dot) compared to the benchmark, which means the reward shaping fields helped the agent team learn better. Besides, it shows that different information has specific effects on tuning its own corresponding behavior; the field of 𝛼 helps agents form a better box’s rotational position, and the field of 𝛽 guides agents to explore a good strategy to arrive at a good relevant position between the two boxes. Convex Singular vs. Concave Continuous Reward Shaping: Comparing Fig. 17-(b) and Fig. 17- (e), one can observe that different fields have different impacts on learning quality. Reward shaping fields constructed by a convex function with a singularity at the goal perform better on guiding agents than those by a concave and continuous function. The singularities in the reward shaping fields are especially effective in stimulating agents’ learning and resulting in better training quality. The convex function leading to the sharper changing rate around the goal provides better gradient information and clearer signals for agents to capture the required “knowledge” for achieving the “L-shape” configuration. Gradient information is crucial for RL training, especially when reward shaping is involved. The results indicate that using convex functions in reward shaping fields is more effective for training than concave functions. Complete Reward Shaping: The P2-field and P3-field simulate the situations when complete information is available for reward shaping, such as degrees of freedom of a desired shape and signals that are needed to configure that shape. By incorporating more information, different reward shaping fields can be composed. P2-field applies a concave function without singularity on the angle 𝛼 and a convex function with singularity on the angle 𝛽 , while P3-field applies two convex functions with singularities on both angles 𝛼 and 𝛽 . By comparing the performance results 73 of the trained agent teams in Fig. 17-(c) with Fig. 17-(d), one can see that P3-field excels over the P2-field on training quality greatly. In Fig. 17-(d), a majority of training cases gather around the target point (red dot), indicating that P3-field is much more effective in guiding agents to learn how to complete the assembly task. It proves again that the gradient information does impact the shaping significantly, and convex functions have better effects than concave functions. The above results show that the final configuration of the “L-shape” assembly task is highly influenced by different reward shaping fields. A better design of the reward shaping field can improve the precision of the final assembly configuration. 4.5.2 Impact of reward shaping gradients As indicated in Fig. 16 to Fig. 19 and the discussion in Chapter 4.5.1, reward shaping fields constructed by convex functions with singularities can be more effective for guiding agent teams in learning for assembly tasks. To go further, we focus on P3-field and investigate how the reward shaping gradients impact the training quality by measuring the following metrics: • Best performance—i.e., the closest learning team’s result to the goal position. • Average performance—i.e., the mean of all learning teams’ results. • Majority performance—i.e., the mean learning performance of the main results cluster. • Deviation from the mean—i.e., the standard deviation of the learning teams’ results from the average performance. • Outliers—i.e., the learning teams’ results that are significantly deviated from the majority of teams. 74 Fig. 20 shows a set of convex functions applied to reward shaping fields. They have different gradients controlled by the coefficients 𝐶 𝛼 (Fig. 20-(a)) and 𝐶 𝛽 (Fig. 20-(b)) in equations (32) and (33). One can see that the bigger the coefficients are, the steeper the fields; hence, stricter rules are applied in the training process of the agent teams. (a) α-fields (b) β-fields Figure 20. Convex shaping reward functions with different gradients. Fig. 21 to Fig. 24 demonstrate the final configurations of the results trained by the P3 reward shaping field for 3, 5, 7, and 9-agent teams. The coordinates and the results plotted in the figures follow the same convention as described in Chapter 4.4. Mean (average performance), and standard deviation (deviation) can be found in the upper-left blue box in each plot, the same as in Chapter 4.5.1. 75 Figure 21. Final configurations for 3-agent teams: (a) P0 field, (b) - (k) P3-field, (b) 𝐶 𝛼 = 𝐶 𝛽 = 0.1, (c) 𝐶 𝛼 = 𝐶 𝛽 = 0.3, (d) 𝐶 𝛼 = 𝐶 𝛽 =0.5, (e) 𝐶 𝛼 = 𝐶 𝛽 = 0.7, (f) 𝐶 𝛼 = 𝐶 𝛽 = 0.8, (g) 𝐶 𝛼 = 𝐶 𝛽 = 1, (h) 𝐶 𝛼 = 𝐶 𝛽 = 1.2, (i) 𝐶 𝛼 = 𝐶 𝛽 = 1.4, (j) 𝐶 𝛼 = 𝐶 𝛽 = 3, (k) 𝐶 𝛼 = 𝐶 𝛽 = 10. 76 Figure 22. Final configurations for 5-agent teams: (a) P0 field, (b) - (k) P3-field, (b) 𝐶 𝛼 = 𝐶 𝛽 = 0.1, (c) 𝐶 𝛼 = 𝐶 𝛽 = 0.3, (d) 𝐶 𝛼 = 𝐶 𝛽 =0.5, (e) 𝐶 𝛼 = 𝐶 𝛽 = 0.7, (f) 𝐶 𝛼 = 𝐶 𝛽 = 0.8, (g) 𝐶 𝛼 = 𝐶 𝛽 = 1, (h) 𝐶 𝛼 = 𝐶 𝛽 = 1.2, (i) 𝐶 𝛼 = 𝐶 𝛽 = 1.4, (j) 𝐶 𝛼 = 𝐶 𝛽 = 3, (k) 𝐶 𝛼 = 𝐶 𝛽 = 10. 77 Figure 23. Final configurations for 7-agent teams: (a) P0 field, (b) - (k) P3-field, (b) 𝐶 𝛼 = 𝐶 𝛽 = 0.1, (c) 𝐶 𝛼 = 𝐶 𝛽 = 0.3, (d) 𝐶 𝛼 = 𝐶 𝛽 =0.5, (e) 𝐶 𝛼 = 𝐶 𝛽 = 0.7, (f) 𝐶 𝛼 = 𝐶 𝛽 = 0.8, (g) 𝐶 𝛼 = 𝐶 𝛽 = 1, (h) 𝐶 𝛼 = 𝐶 𝛽 = 1.2, (i) 𝐶 𝛼 = 𝐶 𝛽 = 1.4, (j) 𝐶 𝛼 = 𝐶 𝛽 = 3, (k) 𝐶 𝛼 = 𝐶 𝛽 = 10. 78 Figure 24. Final configurations for 9-agent teams: (a) P0 field, (b) - (k) P3-field, (b) 𝐶 𝛼 = 𝐶 𝛽 = 0.1, (c) 𝐶 𝛼 = 𝐶 𝛽 = 0.3, (d) 𝐶 𝛼 = 𝐶 𝛽 =0.5, (e) 𝐶 𝛼 = 𝐶 𝛽 = 0.7, (f) 𝐶 𝛼 = 𝐶 𝛽 = 0.8, (g) 𝐶 𝛼 = 𝐶 𝛽 = 1, (h) 𝐶 𝛼 = 𝐶 𝛽 = 1.2, (i) 𝐶 𝛼 = 𝐶 𝛽 = 1.4, (j) 𝐶 𝛼 = 𝐶 𝛽 = 3, (k) 𝐶 𝛼 = 𝐶 𝛽 = 10. While Fig. 21– Fig. 24 provide straightforward views of the agent teams’ task performance under different settings, some critical metrics of training results can be quantitively summarized in boxplots, as shown in Fig. 25. It describes the data distributions in Fig. 21-Fig. 24, where x-axes present coefficient values (i.e., 𝐶 𝛼 and 𝐶 𝛽 ) in the reward shaping fields varying from 0.1 to 10, and y-axes are the statistical indicators. In each box plot, the orange line represents the median Euclidean distance of one group of training (20 cases), and the box area (majority performance) 79 shows the data range from the first quartile (Q1) to the third quartile (Q3). The flier points shown in Fig. 25 are those out of the 1.5 * IQR (Q3-Q1) (outliers). (a) 3-agent teams (b) 5-agent teams (c) 7-agent teams (d) 9-agent teams Figure 25. Euclidean distance boxplot of training results with different reward shaping fields. Effective Range: As shown in Fig. 22, the task results of the trained teams follow a pattern that the proportion of cases with the higher best performance and majority performance and lower majority deviation rises as the coefficients increase, but at the cost of more outliers. At the same time, there exist thresholds of the coefficients. The training suffers the risk of having very low best, average, and majority performance and high deviation when coefficients are out of that range and below or over the thresholds. In Fig. 22-(k), the data points are distributed very sparsely, indicating that nearly no teams can learn well when the coefficients are set to 10. From a 80 statistical perspective, Fig. 25-(b) shows the same tendency. When the coefficients are out of the effective range, the data distributions deviate from the goal extremely. Mild Reward Shaping: Fig. 22-(b), Fig. 22-(c) and Fig. 22-(d) show the results trained by a mild or flatter reward shaping field whose coefficients are 0.1, 0.3 and 0.5. All the cases are distributed around the target shape (good average performance), but the high deviation means only a few of them can perform perfectly. It indicates that flatter reward shaping fields can benefit learning by lowering the barrier of catching the system preference but limit the teams’ potential to perform perfectly. A mild reward- shaping field leads to somehow ‘okay’ results but far from ‘perfect.’ Sharper Reward Shaping: As reward shaping fields become sharper (i.e., coefficient values are bigger), the distributions concentrate more on the central point, meaning the agent teams can learn better and create close to perfect configurations with higher best performance, better majority performance, and lower deviation of the majority teams. (a) 3-agent teams (b) 5-agent teams 81 (c) 7-agent teams (d) 9-agent teams Figure 26. Mean and standard deviation of Euclidean distance with different reward shaping fields. Fig. 26 focuses on tracking the performance of the majority of teams (majority performance) that are the main data clusters. In Fig. 26, the orange error bars present the mean (average performance) and deviation of the majority of teams among 20 trials, while the blue error bars present all teams as a comparison. One can observe that the majority teams behave much better than the whole teams, with a smaller mean and deviation when the coefficients increase. It indicates that the shaper reward shaping fields encourage pioneering teams to outperform others and become more capable of performing given tasks. However, as the gradient increases, overfitting happens, and more outliers appear, gradually leading to larger deviations, as indicated in Fig. 22-(h), Fig. 22-(i), and Fig. 22-(j). The blue error bars for 1.2, 1.4, and 3 coefficients in Fig. 26 show the same tendency. The results have revealed that sharper reward shaping is desirable but may risk overfitting for training failures. Therefore, engineers can regulate the risk by tuning the gradients of the reward shaping field, pointing to the scale of the rules’ strictness of reward shaping field analysis and design. 82 4.5.3 Interaction with team size An organization’s performance is influenced greatly by its size, which also applies to self- organizing systems. Usually, small teams have higher flexibility and spend lower costs on team coordination and task performance, but their ability and capacity are limited. Contrastively, large teams have a larger potential to implement more complex strategies, but at the cost of higher requirements on the information integration in a complex system for teams’ efficient training. In this subsection, the effect of team size on reward shaping with different gradients is assessed. Based on the simulation results of P3-field with different gradients (0.1, 0.3, 0.5, 0.7, 0.8, 1, 1.2, 1.4, and 3) in the context of team sizes of 3, 5, 7, and 9, the statistics of Euclidean distance to the goal position of the task performance including standard deviations are examined and presented in Fig. 27, which illustrates how different reward shaping fields interact with team size. In Fig. 27, the x-axis represents gradient values, and for each value, there are task results of four trained teams with the size of 3, 5, 7, and 9, respectively. The y-axis indicates the mean of Euclidean distance of 20 training trials and their standard deviations. Consistent with Chapter 4.4, zero Euclidean distance means a perfect “L”. Figure 27. Task performance comparison with different team sizes. 83 In Fig. 27, it can be seen that larger teams favor the reward shaping fields with bigger gradients, while smaller teams perform better under reward shaping with lower and middle range gradients. In 3-agent groups and 5-agent groups, they exhibit better means and smaller deviations when the coefficients are in the range of 0.3 to 0.5, and their team performance excels the larger teams. When the coefficients become larger (such as 1.2 and 1.4), the sharp fields begin to hurt the teamwork, resulting in larger mean and bigger deviations. On the other hand, for 7-agent teams and 9-agent teams, the performance gradually improves as the field gradients increase. While too mild fields cannot lead to efficient learning, the large teams’ potentials are simulated after the gradients arrive at 0.7. Their performance turns to exceed the small teams with smaller mean and deviation. That implies that more complex organizations demand stricter rules to acquire effective learning. Furthermore, it is intriguing to see that for a team of a given size, there exists a “sweet spot” gradient that leads to the best training outcome, such as better average performance (i.e., closer to the goal configuration) and much smaller deviations of the agent teams in acquiring the assembly knowledge. From Fig. 27, one can observe that the ‘sweep spots’ of 3-agent teams and 5-agent teams appear at 𝐶 𝛼 = 𝐶 𝛽 = 0.5, while 7-agent teams have their best performances around 𝐶 𝛼 = 𝐶 𝛽 = 1, and 9-agent teams reach the best output when the gradients are around 1.4. The trend shows that the ‘sweet spots’ get larger as the team size increases. That provides hints of designing an effective multiagent RL on proper reward shaping field selection corresponding to the size of the agent teams. Furthermore, 3-agent teams’ performance is not as good as other teams in the “L- shape” assembly task, indicating that a certain number of agents is needed for a complex task to release teams’ potential and make an optimal policy practical. 84 Moreover, the effect of team size on training efficiency is examined in Fig. 28. Learning curves of 3-, 5-, 7- and 9–agent teams are respectively presented as green, blue, orange, and purple curves in each plot, which are trained by RS P3-field but with different field gradients. They are the average cumulative rewards of 1-20 random seeds in training. A tendency of how the training efficiency of learning teams get influenced by the RS field gradients can be observed in Fig. 28. As the gradients increase, the gaps between different teams’ performances get larger, meaning that the impact of the RS fields on team size becomes more distinctive. While the smaller teams only show fitness on milder fields, the larger teams reveal higher robustness to various training RS fields with higher max rewards and faster learning speed training results in all RS fields. (a) 𝐶 𝛼 = 𝐶 𝛽 = 0.1 (b) 𝐶 𝛼 = 𝐶 𝛽 = 0.3 (c) 𝐶 𝛼 = 𝐶 𝛽 = 0.5 (d) 𝐶 𝛼 = 𝐶 𝛽 = 0.7 (e) 𝐶 𝛼 = 𝐶 𝛽 = 0.8 (f) 𝐶 𝛼 = 𝐶 𝛽 = 1 (g) 𝐶 𝛼 = 𝐶 𝛽 = 1.2 (h) 𝐶 𝛼 = 𝐶 𝛽 = 1.4 (i) 𝐶 𝛼 = 𝐶 𝛽 = 3 (j) 𝐶 𝛼 = 𝐶 𝛽 = 10 Figure 28. Learning curve comparison of different team sizes with P3-fields. We can see from Fig. 28-(b) to Fig. 28-(e) that the 7-agent teams and 9-agent teams remain the best training efficiency for nearly all RS fields with various gradients. However, 5-agent teams cannot perform as well as 7-agent or 9-agent teams when the gradients are larger than 1. Such influence is displayed more obviously on 3-agent teams, showing that those teams have smaller 85 max rewards and slower learning speed when the gradients are more than 0.7. Besides, 3-agent teams suffer from much lower max reward than other groups of learning teams as the fields get sharper, shown in Fig. 28-(g), Fig. 28-(h), and Fig. 28-(i). It indicates that larger teams possess higher capabilities to release the potential learning from a sharper RS field. When the system proposes strict rules, they can encourage the larger teams to perform better but have a slight effect on the smaller size of teamwork. Furthermore, similarly as claimed in Chapter 5.2, an effective range of the RS field gradients exists for training efficiency. When the gradients are set as 10, no teams can learn a valid strategy or gain sufficient rewards, as shown in Fig. 28-(j). 4.6. Summary and Findings Self-organizing systems are desirable for performing complex tasks with changing situations as long as the agents in the system possess sufficient knowledge. The approaches of applying task fields and social fields aim to use simple agents that self-organize in artificial fields with complex mathematical forms. Multiagent reinforcement learning opens ways to attain more sophisticated and knowledgeable agents through training, but its success depends on how the reward functions are designed given the task context. In this chapter, the impact of reward shaping in multiagent reinforcement learning is investigated in the context of “L-shape” assembly tasks. Based on the experiment results and ensuing discussions, the following conclusions can be drawn. • Reward shaping fields can be highly effective in guiding agents’ learning process when the proper reward shaping functions and important reward signals are included in the fields. Even partial reward shaping can improve task performance to a certain extent. 86 • Convex functions with singularity can offer effective guidance for agents to learn how to achieve the best task performance. The task goals should be framed as singular points in reward shaping functions. Convex functions are preferred to provide effective gradients than concave functions for agent teams to learn. • The gradient of reward shaping fields is an essential factor for the successful training of agent teams. Sharper fields benefit teams to reach excellent task performance but at the cost of risking overfitting and bigger deviation, while milder gradients generate overall normal results with fewer outliers. • The effective range of gradient changes depending on the size of agent teams. In general, large teams tend to favor steeper gradients, and smaller teams call for milder gradients. For a given agent team of any size, there is an effective range and even a “sweet spot” of the gradient value that leads to the most effective training and best task performance. 87 Chapter 5. Social Learning in MARL 5.1. Introduction Socialization, the process whereby an individual learns to adjust to a society (group) and behave in a manner [129], is an important phenomenon in nature. Certain norms have been formed in various species groups, which benefits adoption and evolution [130]. In self-organizing systems, the question, “do such social norms exist in artificial robot groups and benefit intelligence growth?” arises naturally. In self-organized systems driven by field-based approaches, social structuring is introduced in the system manually to assist in solving more complex tasks [13]. It is represented in forms of social fields that are transferred from pre-defined social rules (sRule) by a field formation operator (FLDs) [13]. It adds the social layers to CSO systems successfully and increases the system complexity to enable cells to tackle more tasks. When a complex system is controlled and driven by learning-based approaches, such as MARL, the ways that robot members organize themselves are more like human teamwork in society. Effective communication and information synchronization is crucial in maintaining high levels of cooperation and productivity within a human team. Meanwhile, behavioral norms are shaped and embedded in human cultures. Similarly, in an AI robot team, robots are free of localized control but decide actions themselves. Thus, it is reasonable to infer that emergent behaviors may exhibit certain patterns in the absence of pre-defined reaction programs. For instance, acceptable behaviors are recognized as maximizing their own interests and achieving a common team goal at the same time. 88 However, the abstraction of the training process makes the team behavior codes hidden behind function approximators such as neural networks, which are hardly understood by a human. That makes AI agents fundamentally different from human workers and the SOS design more challenging. Since designers seek to leverage learning-based approaches to avoid the complexity of converting knowledge into semantic or mathematical formats, a comprehensive comprehension of the mechanisms operating within AI teams is essential. Designing reward functions that provide more informative feedback is a method for augmenting system performance with exterior knowledge. In contrast, social learning fosters individual agent evolution and knowledge acquisition through internal interactions within the group. Promising advances in AI can unleash a team's potential beyond the scope of human designers and minimize human involvement in the design process. While it is desirable to create robust and intelligent systems through good design, our research also aims to minimize the design cost of designing and building AI teams. This chapter is motivated by the need for a deeper understanding of AI teams with social abilities. Thus, it will focus on the social learning studies in AI groups and their effects on group behaviors in the context of assembly tasks. Furthermore, we will also explore the underlying reasons for these effects from the perspective of team energy efficiency and teamwork division. It is interesting to see how social information influences group intelligence, and it is believed that the behavior codes would be revealed. Therefore, three research questions are addressed. 1) what is the effect of social learning on learning efficiency and task performance? 2) how does social learning interact with task complexity? 3) why do the social abilities in teams have such an impact? 89 It is worth mentioning that, in this thesis, social learning is in its limited form, i.e., observing the actions of other agents introduced to the MARL process. In order to achieve it, a novel framework is developed and tested in the context of assembly tasks complicated with multiple static obstacles. The experimental results are demonstrated in Chapter 5.4, and in-depth discussions are followed to reveal the group behavior mechanism. 5.2. Social Learning Modeling To experimentally study the social learning in multi-agent reinforcement learning, a Multi- agent Social Deep Q-learning (MASo-DQL) framework is introduced in Chapter 3.3 to inject social learning into the multiagent deep Q-learning. In this section, we describe how the new architecture is used to impart social abilities to agents for completing complex assembly tasks. In the implementation, homogenous agents are trained individually with their own neural networks to maximize shared rewards, as all agents work collaboratively to complete the task. The hyperparameters for each individual’s neural network are listed in the table below. Table 8. Hyperparameter settings. Training episodes 20,000 Discount factor 0.99 Memory buffer size 1000 Mini-batch size 32 Target network update frequency 200 Learning rate 0.001 Neural network size (63, 64, 128, 6) epsilon 1 → 0.01 90 As mentioned previously, robots can utilize sensors to send and receive signals which allows them to hold broad views in real-world scenarios. Thus, by providing both task-related and social-related information to agents during the simulation, we can enable them to exhibit social abilities, which refers to the team's ability to synchronize information among team members at each step. The task- related and social-related information is obtained from a robot’s Task Views and Social Views, respectively. In the MASo-DQL framework, they are held in the environmental state 𝑆 𝐸 and social state 𝑆 𝑆 for MARL training. In that way, agents can learn to react based on both environmental occurrences and their team members’ actions. In this study, the environmental state 𝑆 𝐸 remains the same as the state 𝑆 in the experiment in Chapter 4. It is defined as a 63-digit tuple to represent vicinity situations and the dynamical states of the box, shown in equation (41). 𝑆 𝐸 = < 𝑣𝑖𝑐𝑖𝑛𝑖𝑡𝑦 𝑠𝑖𝑡𝑢𝑎𝑡𝑖𝑜𝑛 ,𝑣 𝑥 ,𝑣 𝑦 , 𝜔 > 63 ( 41 ) The social state 𝑆 𝑆 is defined based on the social rules 𝐾 , a hyperparameter involved in governing how agents interact with each other. As introduced in Chapter 3.3, the social rules 𝐾 allow each agent holds its own principles, noted as 𝐾 𝑖 . Thus, the overall social relations are formed by all agents, presenting as 𝐾 ~ 𝐾 1 × … × 𝐾 𝑁 . The social information will be updated at every step of the training. In this study, we limit social abilities to Social Levels 0 and 1 to address the performance comparison with and without social abilities. Social Level 1 refers to agents being able to observe others' actions, though observation ranges may vary depending on sensor ability. Social Level 0 serves as a benchmark that refers to no social abilities. 91 As known previously, the social abilities in level-1 are limited in observing other members’ action choices without differentiating their identities. Therefore, 𝑆 𝑆 can be represented in a tuple holding the observed signals. The length of 𝑆 𝑆 varies depending on different observational ranges. It simulates the scenarios in which robots have the ability to receive signals from their neighboring robots. Given the potential for the amount and representation of social information to have varying effects, this study defines four types of social abilities in the context of assembly tasks, which are listed below. • No Social (NS) – No social abilities. • Observe Two Neighbors (ON2) – Agents can observe two neighboring areas and know if those areas are pushed by other agents. • Observe Four Neighbors (ON4) – Agents can observe four neighboring areas and know if those areas are pushed by other agents. • Observe Five Neighbors (ON5) – Agents can observe five neighboring areas and know if those areas are pushed by other agents. The baseline is set as “NS,” referring to “No Social,” which applies multiagent deep Q-learning introduced in Chapter 3.2 for training. In this case, agents only receive the environmental state. 92 Figure 29. Area division around the dynamic box. Social abilities in this study are distinguished by the view range to investigate the effect of social learning. As shown in Fig. 29, the box is divided into six regions, and at each step, one agent can choose one of them to push. Thus, all the agents in the team are distributed in the six areas around the dynamic box. Agents with different social abilities can observe their neighboring areas with a predefined range and receive the corresponding social state in each step. In the case of “Observe Two Neighbours (ON2)”, we define that one agent can see if other agents appear in its left and right adjacent areas, as shown in Fig. 30-(b). Thus, the social state is defined as: 𝑠 𝑖 𝑆 = < 𝐼 𝑙𝑒𝑓𝑡 , 𝐼 𝑟𝑖𝑔 ℎ𝑡 > 𝑂𝑁 2 ( 42 ) where 𝐼 𝑙𝑒𝑓𝑡 1 and 𝐼 𝑟𝑖𝑔 ℎ𝑡 1 are indicators to show whether the left and right adjacent areas of the i th agent are pushed by someone. Similarly, when the agent has broader views to observe left two and right two areas, the social state in the case of “Observe Four Neighbours (ON4)” is represented in a four-digit tuple as in the following equation, shown in Fig. 30-(c). 93 𝑠 𝑖 𝑆 = < 𝐼 𝑙𝑒𝑓𝑡 2 , 𝐼 𝑙𝑒𝑓𝑡 1 , 𝐼 𝑟𝑖𝑔 ℎ𝑡 1 ,𝐼 𝑟𝑖𝑔 ℎ𝑡 2 > 𝑂𝑁 4 ( 43 ) where 𝐼 𝑙𝑒𝑓𝑡 1 and 𝐼 𝑟𝑖𝑔 ℎ𝑡 1 are indicators to show if the left and right adjacent areas of the i th agent are pushed by someone. 𝐼 𝑙𝑒𝑓𝑡 2 and 𝐼 𝑟𝑖𝑔 ℎ𝑡 2 point to the secondary adjacent areas. In the case of “Observe Five Neighbours (ON5)”, one agent can see all other positions wherever it stands around the box, shown in Fig. 30-(d). The social state can be represented: 𝑠 𝑖 𝑆 = < 𝐼 𝑙𝑒𝑓𝑡 2 , 𝐼 𝑙𝑒𝑓𝑡 1 ,𝐼 𝑜𝑝𝑝𝑜𝑠𝑖𝑡𝑒 , 𝐼 𝑟𝑖𝑔 ℎ𝑡 1 ,𝐼 𝑟𝑖𝑔 ℎ𝑡 2 > 𝑂𝑁 5 ( 44 ) where 𝐼 𝑙𝑒𝑓𝑡 1 and 𝐼 𝑟𝑖𝑔 ℎ𝑡 1 are indicators to show if the left and right adjacent areas of the i th agent are pushed by someone. 𝐼 𝑙𝑒𝑓𝑡 2 and 𝐼 𝑟𝑖𝑔 ℎ𝑡 2 point to the secondary adjacent areas. 𝐼 𝑜𝑝𝑝𝑜𝑠𝑖𝑡𝑒 points to the opposite area of the i th agent. (a) NS (b) ON2 (c) ON4 (d) ON5 Figure 30. Agent with different social abilities in the context of box-push task. Therefore, the state in MASo-DQL can be reconstructed as 𝑆 𝑆 = 𝑆 𝐸 ⌢ 𝑆 𝐾 to contain full information, which leads agents to learn and iterate their policies from both task-related and social- related signals. Among a team of agents, all agents are homogeneous and possess the same social 94 abilities, but their received social information may vary depending on their relevant positions to the dynamic box and field of view. Additionally, the action space and reward schema in this study are set identically to those used in Chapter 4, as shown in equations (23) to (26) and equation (37). The reward functions utilized in all cases remain the same and employ P3 reward shaping fields for angles 𝛼 and 𝛽 , with coefficients set to 1. The notations for the four cases are summarized in Table 9. Table 9. Notations in MASo-DQL in NS, ON2, ON4, and ON5. NS ON2 ON4 ON5 𝐾 N/A Each agent can see if its left and right adjacent two areas are pushed in the current step. Each agent can see if its left two and right two adjacent areas are pushed in the current step. Each agent can see if all other areas are pushed in the current step. 𝑨 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 22) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 22) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 22) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 22) 𝑺 𝑬 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 40) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 40) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 40) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 40) 𝑺 𝑺 N/A 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 42) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 43) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 44) 𝑹 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 23) – (26), ( 37) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 23) – (26), ( 37) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 23) – (26), ( 37) 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 ( 23) – (26), ( 37) 5.3. Experiment Design The experiments are simulated in pygame [127] and pymunk [128], the same as in Chapter 4. To study the social learning in MARL, the experiments are designed to address research questions 1) what is the impact of these social learning capabilities on the assembly tasks? 2) how does social learning interact with task complexity? 3) why the social abilities have such an impact, and how do they influence teamwork? Fig. 31 illustrates the experiment design of this study. Two 95 independent variables are set as input, which are social abilities illustrated in Fig. 30 and task complexities. Correspondingly, learning efficiency, training quality and team energy efficiency are applied as metrics to evaluate the learning. Furthermore, the number of steps required to complete the task serves as an indicator of the effectiveness of the learned policy. By examining both the number of steps and the task performance, we can evaluate the energy efficiency of a robot team. The study focuses on its implementation in the “L”-shape assembly tasks involving obstacles. Different numbers and positions of the obstacles in the fields are studied to investigate the interaction in the context of increasing levels of task complexity. The details of the experiment design will be discussed in the following section. Figure 31. Experiment design for the social learning study. 5.3.1 Independent variables Social Ability: To study the effect of social learning, this thesis discusses different social abilities of an agent team, including ON2 (Observe 2 Neighbours), ON4 (Observe 4 Neighbours), and ON5 (Observe 5 Neighbours). Besides, the agent teams without social capability are set as a benchmark, named NS (No Social). It is assumed that broader social views, which feed more information in training, mean stronger social ability. As illustrated in Chapter 5.2, the agent teams with social abilities can receive the information through the social state 𝑆 𝑠 , as shown in equations (42), (43), and (44). 96 Tasks Complexity: Since discovering the latent mapping between task and system design is one of the main goals of this study, we add more obstacles in the field to increase task complexity. As shown in Fig. 32, one to three static circle obstacles (represented as grey circles) are arranged in different positions. They are set to hinder the way to the target box, which increases the trajectory searching difficulty during the task process. The exact settings are described in Table 10, and the tasks are numbered “Task 1”, “Task 2”, “Task 3a” and “Task 3b,” respectively. (a) Task 1 (b) Task 2 (c) Task 3a (d) Task 3b Figure 32. New tasks illustrations. Table 10. Environment setting in “L”-assembly tasks with collision avoidance. Name Space Task 1 Task 2 Task 3a Task 3b Field size 1000 * 1000 1000 * 1000 1000 * 1000 1000 * 1000 97 Box size 180 * 60 180 * 60 180 * 60 180 * 60 Target box size 180 * 60 180 * 60 180 * 60 180 * 60 w/ Obstacles Yes Yes Yes Yes Box start center coordination (150, 180) (150, 180) (150, 180) (150, 180) Target box center coordination (950, 500) (950, 500) (950, 500) (950, 500) Obstacle center coordination (350, 170) (350, 170) (350, 570) (350, 170) (350, 370) (500, 700) (350, 170) (350, 370) (350, 270) Box mass (kg) 1 1 1 1 Push impulse (1 N∙s ) 1 1 1 1 Note: The unit for length is the pixel. Besides the number and position of obstacles, the new tasks share the same requirements and logistics in performing the task in Chapter 4. • <Move> <Dynamic Box> to <Target Box> • <Rotate> <Dynamic Box> to <Target Box> • <Dynamic Box> and < Target Box> in <L-Shape> • <Move> <Dynamic Box> from <Walls> • <Move> <Dynamic Box> from <Obstacles> In order to study the interaction of social learning with task complexity, the four tasks’ complexities are quantitively estimated based on the method introduced in Chapter 3.6. Following the task model, the complexity contributory factors (CCFs) for each are listed in Table 11. 98 Table 11. Task complexity contributory factors (CCFs). Based on equations (14) to (21), we estimated the complexities and presented them in Table 12. Since the simulation fields are in the pixel dimension in pygame, we normalized the results into dimensionless values by dividing them by 100. The results are listed in the table below. Table 12. Task complexity estimations. The detailed calculation processes are provided in Table 13. CCFs Correlation Task 1 Task 2 Task 3a Task 3b Dimension of the task (DofT) Positive 2 2 2 2 Number of static elements (NSE) Positive 5 5 5 5 Number of dynamic elements (NDE) Positive 0 0 0 0 Number of relevant obstacles (NRO) Positive 1 2 3 3 Physical requirements (PRs) Positive 2 2 2 2 Collision avoidance requirements (CARs) Positive 3 3 3 3 Target configuration requirements (TRs) Positive 2 2 2 2 Complexity Estimation Task 1 Task 2 Task 3a Task 3b Complexity of INPUT 37 53 66 74 Complexity of PROGRESS 6 6 6 6 Complexity of GOAL 2 2 2 2 Task Complexity 455.35 644.76 787.32 889.56 Normalized Task Complexity 4.6 6.4 7.9 8.9 99 Table 13. Task complexity estimations: calculation processes. Task 1 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 [( ( 1 950 ∙ 1000 ∙ 1 + 1 410 ∙ 1000 ∙ 1 + 1 590 ∙ 1000 ∙ 1) + 0 + ( 1 732.40 ∙ 𝜋 ∙ 15 2 ) )] 2 = ( 5.19 + 0.97) 2 = 37.95 ( 45 ) 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐶𝐸𝑆𝑆 3 × 2 = 6 ( 46 ) 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 2 ( 47 ) 𝑇𝑎𝑠𝑘 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 455.35 Task 2 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 [( ( 1 950 ∙ 1000 ∙ 1 + 1 410 ∙ 1000 ∙ 11+ 1 590 ∙ 1000 ∙ 1) + 0+ ( 1 732.40 ∙ 𝜋 ∙ 15 2 + 1 600.33 ∙ 𝜋 ∙ 15 2 ) )] 2 = ( 5.19 + 2.14) 2 = 53.73 ( 48 ) 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐶𝐸𝑆𝑆 3 × 2 = 6 ( 49 ) 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 2 ( 50 ) 𝑇𝑎𝑠𝑘 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 644.76 Task 3a 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 [ ( ( 1 950 ∙ 1000 ∙ 1 + 1 410 ∙ 1000 ∙ 1 + 1 590 ∙ 1000 ∙ 1) + 0+ ( 1 732.40 ∙ 𝜋 ∙ 15 2 + 1 600.33 ∙ 𝜋 ∙ 15 2 + 1 463.25 ∙ 1 2 ∙ 𝜋 ∙ 15 2 ) ) ] 2 = ( 5.19 + 2.91 ) 2 = 65.61 ( 51 ) 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐶𝐸𝑆𝑆 3 × 2 = 6 ( 52 ) 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 2 ( 53 ) 𝑇𝑎𝑠𝑘 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 787.32 Task 3b 100 𝐶𝑂𝑀 𝑃 𝐼𝑁𝑃𝑈𝑇 [ ( ( 1 950 ∙ 1000 ∙ 1 + 1 410 ∙ 1000 ∙ 1 + 1 590 ∙ 1000 ∙ 1) + 0 + ( 1 732.40 ∙ 𝜋 ∙ 15 2 + 1 600.33 ∙ 𝜋 ∙ 15 2 + 1 552.18 ∙ 𝜋 ∙ 15 2 ) ) ] 2 = ( 5.19 + 3.42 ) 2 = 74.13 ( 54 ) 𝐶𝑂𝑀 𝑃 𝑃𝑅𝑂𝐶𝐸𝑆𝑆 3 × 2 = 6 ( 55 ) 𝐶𝑂𝑀 𝑃 𝐺𝑂𝐴𝐿 2 ( 56 ) 𝑇𝑎𝑠𝑘 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 889.56 5.3.2 Dependent variables Task Performance: Task performance is applied to evaluate the group's intelligence from the perspective of task quality and completeness. In this chapter, we will look into two metrics to evaluate task performance, which are final configurations and performance scores. Final configurations emphasize the completeness of the “L”-shape in our tasks. Thus, we measure it the same as in Chapter 4.4.2 by querying the signals 𝛼 𝑇 and 𝛽 𝑇 at the termination episode, shown in Fig. 11 and presented in equation (57). The reasons for selecting those two angles are 1) the two angles can fully define the relative positions of the two boxes, and 2) they are sensible signals. Similarly, the final configurations can be quantitively presented as a scatter plot similar to that in Fig. 15. Final configurations= < 𝛼 𝑇 , 𝛽 𝑇 > 𝑒𝑝𝑖𝑠𝑜𝑑𝑒 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 ( 57 ) where 𝑇 is the termination episode. Furthermore, we investigate the extent to which the agent teams’ behaviors align with the reward functions. In order to assess whether the team’s optimal policy has effectively captured human’s 101 desired preferences, we define performance scores shown in equation (58) in the same formation as the reward shaping fields as in equation (37). The coefficient is set as 100 to ensure that the performance score fall within the range of 0 to 100. While a team is graded as 100, it signifies that the team has carried out a perfect outcome. Performance Score= 100 ∙ 𝑒𝑥𝑝 ( −|𝛼 𝑇 − 𝛼 𝑔𝑜𝑎𝑙 |)∙ 𝑒𝑥𝑝 ( −|𝛽 𝑇 − 𝛽 𝑔𝑜𝑎𝑙 |) ( 58 ) This metric provides a clear and direct assessment of a team’s learning ability and performance, as well as allows categorizing the training results by setting a threshold for further analysis. Training results surpassing this threshold can be deemed as adequate performance, indicating successful training. Conversely, results falling below the threshold will be considered underperforming cases, meaning unsuccessful training. In the subsequent sections, we determine a score threshold of 55 based on a comprehensive test and evaluation. Score Threshold= 55 ( 59 ) Team Energy Efficiency: In the context of robotics and autonomous system, energy efficiency refers to the ability to optimize and minimize energy consumption while achieving desired objectives [131]. For a robot team, it involves maximizing the utilization of available energy resources, minimizing energy waste, and employing efficient strategies for performing tasks [132]. In practice, an energy-efficient robot team is able to reduce power consumption, prolong battery life, and maintain sustainable operations [132]. Thus, evaluating the energy efficiency of a team becomes essential, especially in a resource- constrained environment. That drives us to focus on the team energy efficiency of a well-trained agent team in performing complex assembly tasks, which aims at assessing whether the learned team policies can optimize both the team performance and energy consumption. 102 In a multi-robot team, each robot consumes a certain amount of energy to support its policy, such as optimal trajectory, stop times, motion strategies, and communication costs [133]. In our study, our focus is primarily on the cost associated with motion strategies, as the majority of energy consumption is attributed to the "push" motions executed by each robot. Thus, each robot’s energy consumption can be estimated roughly proportional to the number of steps or actions taken by itself. And the energy consumption of a multi-robot team can be estimated by measuring the work performed by all robots. As the team size remains constant throughout our subsequent studies, we employ the number of steps as a metric to evaluate energy consumption. Hence, we utilize the performance score shown in equation (58) alongside the number of steps to evaluate team energy efficiency, and such a metric can be built into an evaluation matrix, as shown in Table 14. An ideal team is expected to complete a task in fewer steps while achieving good performance, meaning attaining optimal performance and economic efficiency simultaneously. Table 14. Team energy efficiency evaluation matrix. Performance Scores Low High Number of Steps More Poor Mediocre Fewer Mediocre High To distinguish and classify the training results into different conditions, we employ score and step thresholds as criteria. This enables the categorization of the training outcomes into four distinct conditions in Table 14. The best case is when a team can achieve high performance scores within fewer number of steps. That condition implied that the team employed an efficiency strategy and minimized energy consumption. Thus, it is considered “High” team energy efficiency. When a team obtains low performance scores with fewer steps, it means that the team has not wasted much 103 energy on the task when dealing with a challenging task. In this case, we consider the team's energy efficiency as “Mediocre” even though the team is underperformed. One situation that can arise is when a team completes a task with a moderate performance score but requires a higher number of steps. This indicates that the team has to expend more energy to achieve the desired task output. Such team strategies can be considered as having "Mediocre" team energy efficiency, which is acceptable but not the most optimal. The worst cases are when teams fail to achieve both high task performance and reasonable energy utilization. That is classified as “Poor” team energy efficiency. Furthermore, in order to visualize a group of training results of team energy efficiency, we can plot them in terms of performance scores and the number of steps, as shown in Fig. 33. Based on Table 14, the fields can be divided into four quadrants, which are Area U-eC (Underperforming & Energy-Consuming), Area U-eP (Underperforming & Energy-Preserving), Area A-eC (Adequate & Energy-Consuming) and Area A-eP (Adequate & Energy-Preserving) to represent corresponding conditions of team energy efficiency. The illustrations are in Table 15. The desired outcome of the training results is indicated by the red arrow in the field depicted in Figure 33. 104 Figure 33. Team energy efficiency area division. Table 15. Team energy efficiency evaluation area illustration. Area Performance Scores Number of Steps Explanation Team Energy Efficiency A-eP High Few Adequate team performance and Energy-preserving. High A-eC High More Adequate team performance but Energy-consuming. Mediocre U-eP Low Few Energy-preserving but underperforming team behaviors. Mediocre U-eC Low More Underperforming team behaviors and Energy-consuming. Poor For better understanding, the strategy examples in the context of Task 3b are provided in Fig. 34. In each sub-plot of Fig. 34, the optimal team strategies after training are shown as pink box trajectories, and the pink circles represent obstacles. It can be observed that the team outputs the most satisfactory team policy shown in Fig. 34-(d), which represents one result in Area A-eP. The 105 dynamic box has a successful trajectory characterized by appropriate displacement and rotations. In Fig. 34-(b), the trajectory, an example in Area A-eC, successfully completes the task but requires a greater number of steps to reach the goal. It is acceptable but not the most optimal. Additionally, the trajectories in Area U-eC and Area U-eP cannot output successful task results, but the difference lies in the cost incurred for a failed attempt. It is preferable for a team to consume less energy in case of failure, as shown in Fig. 34-(c). The worst cases are observed in Fig. 34-(a), the example in Area U-eC, where teams fail to complete the task while consuming high amounts of energy. (a) U-eC Strategy (b) A-eC Strategy (c) U-eP Strategy (d) A-eP Strategy Figure 34. Team strategy examples in different areas. 106 Teamwork division: In addition to examining the energy efficiency of a team, we also want to analyze the workload division within the team. In a collaborative task, each robot incurs a certain cost to perform its own role. It is crucial that each unit of cost results in a positive outcome to ensure efficient teamwork. As known previously, six different actions can be selected by each robot to push one position around the dynamic box shown in equation (22) and Fig. 29, and their joint actions result in the dynamic box’s movement, including displacement and rotations. Physically, there exist three pairs of actions that counteract each other, resulting in zero composition of forces or moments when we take the center of the dynamic box as the center of mass of the rigid body, as shown in Fig. 35. (a) Action 1-3 (b) Action 4-6 (c) Action 2-5 Figure 35. Conflicted action pairs. Fig. 35-(a) and (b) shows that action pairs, Action-1 – Action-3 and Aciton-4 – Action-6, resulting in zero composition of external moments, while the pair of Action-2 – Aciton-5 leads to zero composition of external forces in Fig. 35-(c). Thus, if any two agents select conflicting action pairs, it is reasonable to say that they do not make any positive contributions to pushing the dynamic box at the current step. This situation should be avoided. For instance, if all agents engage in conflicting action pairs, this can result in significant energy consumption without any meaningful output. Hence, we consider any agents involved in the conflicted action pairs shown in Fig. 35 as idle 107 agents, i.e., robots that are not currently contributing to the task. Such idle robots can be viewed as ineffective workers and underutilized resources. In our studies, we assume that the team size is set appropriately so that all agents are expected to actively participate and make positive contributions to the task. In accordance with the above, the agents at the current step can be categorized as Main Contributors, who contribute to the object’s movement, and Idle Workers, who are involved in any of the conflicted action pairs. The teamwork division is assessed by examining agents’ action choices made at every step, allowing varying work allocations across different timesteps. Table 16. Teamwork division. Agents Explanation System Preference Main Contributors Those make an effect on the dynamic box, resulting in either displacement or rotation. Preferable Idle Workers Those involved in conflicted action pairs. Not preferable Learning Efficiency: The learning curves are important metrics to monitor and evaluate training performance in RL. In this study, we will look into two indicators in the learning curves, which are the maximum convergence rewards and the learning speed. Since the target of reinforcement learning is to maximize the accumulated rewards through continuously optimizing policies, the maximum convergence rewards play crucial roles in showing how well a trained policy in performing the task. Besides, as another crucial metric, the learning speed tells how fast agents can master an optimal policy and can be estimated by equation (60). It 108 is addressed to speed up the learning speed in some research. Otherwise, deep reinforcement learning may be limited due to expensive computational sources. Learning speed= 𝑀𝑎𝑥 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 𝑟𝑒𝑤𝑎𝑟𝑑𝑠 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠 𝑜𝑓 𝑟𝑒𝑎𝑐 ℎ𝑖𝑛𝑔 𝑡 ℎ𝑒 𝑚𝑎𝑥 𝑟𝑒𝑤𝑎𝑟𝑑𝑠 ( 60 ) 5.4. Results and Discussion Four studies are conducted to address the research questions introduced in Section 5.1, which are 1) what is the effect of social learning on learning efficiency and task performance? 2) how does social learning interact with task complexity? 3) why do the social abilities in teams have such an impact? The agent teams with different social abilities, covering NS (No Social), ON2 (Observe Two Neighbours), ON4 (Observe Four Neighbours), and ON5 (Observe Five Neighbours) are trained to perform different tasks. Similar to Chapter 4, each group of experiments contains 20 training cases with 1-20 random seeds. The team size is set as 10, and shaping reward fields apply P3-fields with coefficients as 1 for all cases. 5.4.1 Influence of social ability on the task performance Limitation of Individual Learning. Individual learning can be broadly understood as learning when the learning agent directly interacts with its environment without the presence of others or their information [134]. The ignorance of the existence of others can benefit the system structures to keep relatively simple logistics and high flexibility. However, the guidance provided by the environmental rewards is insufficient for quality outputs when the task becomes more complex. The results in Fig. 36 demonstrate such a limitation. 109 Fig. 36 presents the results of teams without social abilities for Task 1, Task 2, Task 3a, and Task 3b. It can be observed in Fig. 36-(a) that NS Teams only perform well in the simplest task, Task 1. After adding more static obstacles in the fields such as Task 2, 3a, and 3b, the dots distribute sparsely, meaning that few teams learn good policies to tackle complex tasks when they conduct individual learning. The results strongly indicate the demands for social information involved in the training and seek better cooperative strategies for the agent teams. (a) NS Teams in Task 1 (b) NS Teams in Task 2 (c) NS Teams in Task 3a (d) NS Teams in Task 3b Figure 36. Final configurations of NS Teams in Task 1, 2, 3a, and 3b. Social Learning Influences. Fig. 37 – Fig. 40 presents the task performance of the agent teams with NS, ON2, ON4, and ON5 in all four tasks. Taken team performance in Task 3b as an example shown in Fig. 40, the NS team performance is presented in Fig. 40-(a), while Fig. 40-(b), (c), and (d) demonstrate the team performance with social abilities. It can be seen that the teams’ 110 capabilities increase as their social abilities arise. Compared with NS teams that cannot output good performance with high deviation and many outliers. The dots in Fig. 40 (b) - (d) concentrate around the goal (the red dot), and few of them deviate largely. It means that adding social abilities, even pure observation, can help teams to master better strategies and have better cooperation in complex task domains. The teams become more intelligent to avoid obstacles and assembly parts by observing others’ positions. (a) NS Teams in Task 1 (b) ON2 Teams in Task 1 (c) ON4 Teams in Task 1 (d) ON5 Teams in Task 1 Figure 37. Final configurations of teams with different social abilities in Task 1. 111 (a) NS Teams in Task 2 (b) ON2 Teams in Task 2 (c) ON4 Teams in Task 2 (d) ON5 Teams in Task 2 Figure 38. Final configurations of teams with different social abilities in Task 2. (a) NS Teams in Task 3a (b) ON2 Teams in Task 3a 112 (c) ON4 Teams in Task 3a (d) ON5 Teams in Task 3a Figure 39. Final configurations of teams with different social abilities in Task 3a. (a) NS Teams in Task 3b (b) ON2 Teams in Task 3b (c) ON4 Teams in Task 3b (d) ON5 Teams in Task 3b Figure 40. Final configurations of teams with different social abilities in Task 3b 113 The tendency can be observed clearer in Fig. 41, which shows the statistics of the results of different teams in four tasks. Same as in Chapter 4.5, the box areas show the data points range from Q1 to Q3, and the flier points are those out of the 1.5 IQR. (a) Task 1 (b) Task 2 (c) Task 3a (d) Task 3b Figure 41. Euclidean distance boxplot of training results. Taking Task 3b as an example shown in Fig. 41-(d), it can be seen that the majority performance is greatly improved after adding social learning compared with the baseline (NS Teams). The ON5 Teams give the best output with very good mean, smaller deviations, and fewer outliers in the most complex task. It indicates that a certain level of social capability is helpful, and a proper amount of social information has a big potential to boost team intelligence and improve group behaviors. 114 Furthermore, it can be seen in Fig. 41 that social abilities have a varying impact on team performance in Task 1, Task 2, Task 3a, and Task 3b. The following sections will discuss the implications and the relationship between this impact and task complexity. 5.4.2 Effect of social ability on learning efficiency Fig. 42 presents the learning curves to show the learning process of 20,000 episodes in Task 3b. The blue, red, purple, and green curves represent the cumulative rewards for teams of NS, ON2, ON4, and ON5, respectively. In Fig. 42, the x-axis means the episode number, and the y-axis records the accumulated rewards that the team gains in one episode, following the reward functions. The reward value is the average number of 20 running with 1-20 random seeds. The learning curves monitor the training performance and evaluate the learning efficiency from maximum rewards (convergence rewards) and learning speed. Besides, the shadow areas around each curve indicate the deviation of the 20 cases. Figure 42. Learning curves of NS, ON2, ON4, and ON5 Teams in Task 3b. 115 Impact on the Learning Speed: The inclinations of the learning curves before convergence indicate the speed of learning. It can be seen that the red, purple, and green curves have a flatter gradient than the blue ones, which means the teams with social abilities took a longer time to learn a strategy. That shows the trend that adding more information takes the risk of slowing learning due to more information for processing for each agent. As for a multiagent team like the 10-agent teams in our case, the processing cost increases exponentially [118]. The results reveal the existing tradeoffs between agents’ social abilities and their learning performance. Impact on the Maximum Rewards: In Fig. 42, the four different teams (NS, ON2, ON4, and ON5) achieve different maximum rewards after convergence. Social ability influences the achieved maximum rewards greatly. It can be seen that the teams with social ability excel the baseline greatly. It indicates that social learning can assist the teams in getting higher rewards and better team performances under the same conditions. Besides, the trends can be seen that the more social information that the learning team hold, the bigger the maximum reward they can achieve. That proves again that team members can learn from each other and have better cooperation by knowing social information via social views. 5.4.3 Interaction with task complexity Based on the above study, it becomes apparent that social learning has varying degrees of influence on teamwork across different tasks. In this study, we will look into the interaction of social learning with task complexity. Thus, the learning curves for all tasks are shown in Fig. 43. 116 (a) Task 1 (b) Task 2 (c) Task 3a (d) Task 3b Figure 43. Learning curves in Task 1, 2, 3a and 3b. It can be observed that the influences of social learning vary with the different task complexity. In Fig. 43-(a), which is the simplest task in our study, the learning curves of all teams are close to each other, sharing a similar learning speed and maximum rewards. As task complexity arises, the influences of social learning become more obvious, presenting larger differences in the maximum rewards. In order to better visualize the social learning effect, Fig. 44 shows the gained reward differences between the cases with social abilities (ON2, ON4, and ON5 shown as red, purple, and green curves in Fig. 43) and their corresponding baseline cases (NS as blue curves in Fig. 43) of all four tasks. The performance differences are clustered in blue, red, pink, and green lines in Fig. 44 to 117 represent Task 1, Task 2, Task 3a, and Task 3b, respectively. The x-axis in Fig. 44 means the training episodes, while the y-axis shows the reward differences between cases with social abilities and without social abilities. When the difference is positive, it means that social ability benefits learning. The bigger the difference, the more advantages a team gets from socialization. And the negative differences mean the opposite. (a) less complex tasks: Tasks 1 & 2 (b) more complex tasks: Tasks 3a & 3b Figure 44. Reward difference to baseline (NO SOCIAL). Social Impact on Less Complex Tasks: In Fig. 44-(a), the blue curve cluster shows the reward differences of Task 1, whose task complexity is estimated at 4.6. The blue group fluctuates around the zero line, meaning that social learning has not provided many benefits in achieving more rewards than NS cases. The red curve cluster presents the results of Task 2, which has two obstacles in the fields; thus, its task complexity is estimated as 6.4. Comparing the blue curve cluster, the red ones have slightly higher positive reward differences but in a minor content. That indicates that social learning has not shown adequate benefits on learning behaviors under the condition that the tasks are relatively simple. Social Impact on More Complex Tasks: The reward differences in Task 3a and Task 3b is shown as the pink group, whose task complexity is 7.9, and the green curve cluster, whose task complexity 118 is estimated as 8.9. It can be seen that the green cluster has bigger differences and is much more positively heavier than the pink group after the initiating learning phase. Besides, they both have far positive reward differences compared with the blue and red clusters in Fig. 41-(a). That demonstrates a trend that tasks with bigger complexity require social learning more urgently and can get more benefits from socialization. The proper social levels and capacities are relevant to the task’s complexity. For tasks with less complexity, adding too much socialization may hurt learning. It indicates that social ability has a cost on the learning process, and task complexity is one important factor to influence its effects. What’s more, a proper system design is required for a good output. 5.4.4 Energy Efficiency Analysis In previous sections, the influence of social ability on task performance and learning efficiency has been demonstrated. However, the underlying reasons and mechanisms for this impact are not yet fully understood. Thus, this sub-section aims to investigate the variations from the perspective of energy utilization and, furthermore, to answer the question of why the social abilities in teams have such an impact. Team Energy Efficiency. As discussed in Chapter 5.3.2, the efficiency in energy utilization may vary among teams, and it is important for teams to achieve good performance with reasonable resources. Thus, we plot the training results generated by teams with or without social abilities to delve into the energy utilization mechanism. The team energy efficiency matrix shown in Fig. 33 demonstrates how efficiently resources are used to ensure sustainable operations and meet designers’ expectations in task performance. Thus, we plot the team energy efficiency results of NS, ON2, ON4, and ON5 teams in Task 1 In Fig. 45. 119 Considering the task complexity of Task 1 is estimated as 4.6, we set the threshold of as 55 and the threshold of required steps is 25 based on tests. The performance scores and the corresponding number of steps of each team's training results are plotted on the x-axis and y-axis, respectively, as individual data points. The blue data points in Fig. 45 represent team scores above 55, indicating qualified results, while the red data points mean team scores below 55, indicating unqualified or unsuccessful training. The field in each plot in Fig. 45 is divided into four areas by the dashed gray lines based on the performance and score thresholds. The upper left, upper right, lower left, and lower right areas represent Area U-eC, Area A-eC, Area U-eP, and Area A-eP, respectively. The training’s goal is to increase the number of data points in Area A-eP, which represents teams that exhibit low energy consumption and high performance. Therefore, we have plotted the regression line of the data points in this area to highlight this trend, represented as blue lines in the plots. (a) NS Teams in Task 1 (b) ON2 Teams in Task 1 120 (c) ON4 Teams in Task 1 (d) ON5 Teams in Task 1 Figure 45. Team energy efficiency evaluations for Task 1. It can be observed in Fig. 45-(a) that most data points are located in Area A-eP, and only a few teams fail to learn the task strategy. Besides, most NS Teams can obtain very high scores ranging from 80 to 100. The result suggests that NS Teams can develop highly efficient strategies that utilize energy optimally. Fig. 45-(b), (c), and (d) demonstrate the team energy efficiency results of ON2, ON4, and ON5 teams. It can be seen that energy efficiency decreases after adding social ability to teams. Compared with the data distribution in Fig. 45-(a), more data points appear in Area A-eC. It suggests that the teams with social abilities tend to require more steps and consume more energy to achieve comparable task performance as the NS teams. This implies that adding social abilities to teams for relatively simple tasks risks lower energy efficiency. Designers should be conservative in doing so, considering that the cost may not be economical for simple tasks. Fig. 46 presents the team energy efficiency results for the hardest task, Task 3b, whose task complexity is 8.9. To maintain the task criteria, the performance scores threshold remains at 55, while the threshold of required steps is set as 35, considering the higher task complexity of Task 3b. 121 It can be observed that less than half of NS teams are able to complete the task with acceptable quality in Fig. 46-(a). The majority of data points are clustered in Area U-eP, indicating that those teams failed to figure out an effective team strategy. Few data points are in Area U-eC, representing that the teams are unable to find a good solution even though it consumes a large amount of energy. The training results of teams with social abilities are presented in Fig. 46-(b), (c), and (d). In Fig. 46-(b), the majority of teams are observed to have higher numbers of steps as their optimal solutions. While the number of steps increases, the training process does not converge to the optimal solution, showing Mediocre team energy efficiency for ON2 teams. After adding stronger social abilities, the performances improved a lot, and the majority of ON4 and ON5 teams are distributed in Area A-eP, as shown in Fig. 46-(c) and (d). (a) NS Teams in Task 3b (b) ON2 Teams in Task 3b (c) ON4 Teams in Task 3b (d) ON5 Teams in Task 3b Figure 46. Team intelligence evaluation fields for Task 3b. 122 The results suggest that social abilities can encourage teams to explore more strategies but at the cost of consuming more energy. What’s more, it raises the risk of low energy efficiency. Finding a balance between exploration and energy utilization efficiency in the training requires teams to possess the appropriate level of social ability for a certain task. In less complex tasks, excessive exploration may lead to worse performance and lower team energy efficiency. In contrast, in complex task domains, teams need to engage in exploration during training to find optimal solutions. Thus, teams require a certain level of social abilities to support them in searching in the big state-action space. Fig. 47 provides clearer hints on the effect of social abilities on task performance. Plots in Fig. 47 show the number of steps and their corresponding performance scores in Task 3b, whose x-axis is the team number (controlled by 1-20 random seeds). We use red bars to represent the failure cases (team scores are less than 55) and green bars for the successful cases. Besides the blue dots are the team scores for reference. (a) NS Teams in Task 3b (b) ON2 Teams in Task 3b 123 (c) ON4 Teams in Task 3b (d) ON5 Teams in Task 3b Figure 47. Team performance records in Task 3b. Fig. 47 demonstrates the diverse effects of social abilities on team performance. A significant increase in the success rates of ON4 and ON5 teams compared to NS teams can be observed in Fig. 47-(c) and (d). In contrast, ON2 teams suffer higher rates of failure and consume a larger amount of energy. This suggests that having adequate social abilities can assist a team in handling complex tasks effectively. Conversely, having insufficient social abilities may lead to poor team energy efficiency and even worse task performance. Teamwork Division Analysis. In this section, we investigate the impact of social ability on workload distribution in a team while implementing a policy. As previously mentioned in Chapter 5.3.2, it is assumed that the team size is appropriate, and every agent involved in a task is expected to make a positive contribution. Hence, it is desirable for teams to learn an effective method of managing and allocating tasks to each team member. As previously defined, idle workers are robots involved in conflicted actions, as shown in Fig. 35. We aim to develop joint policies that minimize the number of idle workers. Fig. 48 tracks the number of idle workers in the best NS, the best ON2, the best ON4, and the best ON5 teams among 124 20 pieces of training in completing Task 3b, shown by blue, red, purple, and green dashed lines, respectively. The x-axis indicates the step number, while the y-axis represents the number of idle workers out of the 10-agent team at the current step. Figure 48. Teamwork division of different teams in Task 3b. It can be observed that the best NS teams require 16 steps to complete the task, but most of the time, they have more than 4 idle agents among 10 workers. In contrast, the best ON2 team requires more steps and has more idle workers, often exceeding 6 in the team in each step. In contrast, as social abilities are set to be more suitable for this task, the teamwork division shows to be more reasonable, resulting in fewer required steps and more contributors in the task implementation process. This shows the positive impact of social abilities on the efficient allocation of work in the team. 125 Table 17 demonstrates the total number of idle workers observed in the best-completed episode. It is worth noting that the best NS-Team shows 62 instances of idle actions. However, when social abilities were introduced, the best ON2 team exhibited a substantial increase in idle actions, with 234 occurrences. On the other hand, the best ON4 team shows a significant reduction in idle actions, with only 74 instances, while the best ON5 team performed the best, with just 40 cases of idle actions. Furthermore, the rate of energy utilization can be calculated as shown in equation (61), and the results are presented in the second row in Table 17. Energy Utilization Rate= 1 − 𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑 𝐸𝑛𝑒𝑟𝑔𝑦 𝑏𝑦 𝐼𝑑𝑙𝑒 𝑊𝑜𝑟𝑘𝑒𝑟𝑠 𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑 𝐸𝑛𝑒𝑟𝑔𝑦 ( 61 ) The results indicate that teams with deficient social abilities are less efficient in work division and have lower energy utilization rates. Teams with sufficient social abilities are better able to allocate work more reasonably and have fewer idle agents involved in each step. This allows them to utilize resources more efficiently and find more suitable cooperation strategies, leading to higher task performance and energy efficiency. Table 17. Team energy efficiency evaluation metric. The Best Team NS Team ON2 Team ON4 Team ON5 Team Total number of idle workers in a completed task 62 234 74 40 Energy Utilization Ratio 0.6125 0.4150 0.5647 0.7333 5.5. Summary and Findings In multiagent deep reinforcement learning, the abstraction of the learning process makes the team behavior codes hidden behind neural networks. That leads to great difficulties in designing explicit systemic and organizational rules or informative reward functions to solve high- 126 complexity tasks. Therefore, we seek ways of utilizing an SOS’s learning potential for complex tasks. Social learning, as a common phenomenon in nature, is borrowed and modeled in MARL through a novel framework MASo-DQL in our studies. We have opened agents’ social views to encourage them to learn from each other. Based on the experimental results, it is exciting to see that social learning is promising to release a team’s potential beyond the designers’ scope, benefiting an SOS to be more robust and intelligent. Furthermore, the impact of adding social abilities in robot teams and the underlying reasons for such effects are explored and explained from the perspective of energy utilization. The findings can benefit the design of more efficient and capable AI robot systems. Several conclusions can be drawn based on the experimental results. • Social abilities, even pure observation of others’ actions, can potentially assist teams in solving more complex tasks that are hardly solved by individual learning. For a certain task, the effect of social learning is strongly relevant to the task complexity, and appropriate social ability assigned to a team is the key to successful training. • Social learning has a cost on learning performances. Social ability is more urgently desired for tasks of high complexity, benefiting teams with higher capacities in exploration. However, adding a too-high level of social abilities in tasks of low complexity may hurt learning and risk lower energy efficiency. • Social learning influences teamwork allocation. Teams with sufficient social abilities are able to allocate work more reasonably and have fewer idle agents in the performing process, while ones with deficient social abilities may suffer inefficiency in work division and lower energy utilization rate. 127 This chapter has delved into the effectiveness of social learning and its mechanism in social teams. The results not only present the social learning’s potential for implementing complex tasks but also demonstrate its effects and tendency in the context of assembly tasks with different complexity. It is reasonable to believe that there exists a mapping between desired social information with a determined task. Designers should be conservative in adding social abilities to a team, considering that the cost may not be economical. It is worth mentioning that the results are drawn from the simulations in this study. 128 Chapter 6. Knowledge Transfer in SOSs and Impact of Social Abilities 6.1. Introduction Multiagent systems have evolved in both research and practice domains over the past decades. Application domains have also broadened into robotic and vehicular systems, among others. As previously discussed, an SOS consists of multiple agents and can dynamically adapt to changing operational environments. This property makes SOSs particularly effective in handling complex tasks that require coordination among multiple agents. In order to overcome the design difficulties of the rule-based and field-based methods, we have adopted the MARL approach to let agents acquire the task knowledge by themselves through a trial-and-error process. Multiple agents learn how to accomplish a task collaboratively by maximizing a shared reward function. The research results thus far have verified the effectiveness of the approach. However, training agents from novices has a high cost. Learning from thousands of trials makes this method consume a substantial amount of time and computational sources. Especially when the environments are complex, and designers hold limited prior task knowledge, agent teams need to explore the possible strategies from sparse reward fields [3]. Therefore, it is desired that well- trained knowledge can be accumulated, re-utilized, and even inherited among different teams to avoid repetitive training. Transferring knowledge between tasks can not only reduce computational costs but also enhance the capability of agents to handle complex tasks. Standing on that point, this chapter will conduct knowledge transfer studies to explore the transfer mechanism between different teams. It is hoped that well-trained knowledge still holds the possibility of being re-deployed by different teams without further training. Furthermore, the effect 129 of trained teams’ social ability on knowledge transfer will be explored as well. Since social abilities may make individuals closer to each other, varying impacts are expected because closer links may lead to losing flexibility. In this chapter, the social ability remains consistent as it is in Chapter 5.2, which is limited to observing the behavior of other agents rather than active information exchange. In the deep multiagent reinforcement learning approach that we adopted, the team knowledge is represented as a cluster of neural networks. The following three research questions will be addressed. 1) What are the effects of knowledge transfer between teams of different sizes? 2) What is the impact of knowledge quality on the transformation? 3) Whether the effect may change when the trained teams have the social ability? To answer the questions, we conduct the simulations of ‘L’-assembly tasks involving collision avoidance. 6.2. Knowledge Transfer Schema RL can be applied to capture domain knowledge for various tasks [112]. However, the cost is high for training repetitively under similar circumstances. It is desired that earned knowledge can be reused and transferred among different teams with high knowledge quality. In that way, the knowledge can be applied more efficiently and costly. In MADRL, the task knowledge is embedded in well-trained neural networks (NNs). The parameters and structure of the NNs save information on how to behave in a certain task. Therefore, we explore the mechanism and performance of well-trained NNs to be transferred to other teams with different team sizes in performing the same task. The results are expected to provide hints on the mechanism of knowledge transfer in the same task domain. 130 In this chapter, the teams that are trained from novice are named Trained Teams (Tr), while the teams that are deployed with the Trained Teams’ knowledge are named Test Teams (Ts). Correspondingly, the size of a Trained Team is written as #Tr, while the size of a Test Team is represented as #Ts. For convenience, the knowledge transferred from Tr to Ts in a task is represented as 𝑁 𝑁𝑠 #𝑇𝑟 →#𝑇𝑠 𝑡𝑎𝑠𝑘 . Based on the cases, there are three possible circumstances that may occur. • Condition-1: #Tr == #Ts • Condition-2: #Tr > #Ts • Condition-3: #Tr < #Ts In Condition-1, the two team sizes are identical. Thus, the knowledge is transferred correspondingly from each agent in Tr to the other in Ts. In Condition-2 and Condition-3, the sizes of the two teams are different. When #Tr > #Ts, Test Teams randomly select 𝑁 𝑁𝑠 #𝑇𝑟 →#𝑇𝑠 𝑡𝑎 𝑠 𝑘 to apply [117]. When #Tr > #Ts, Test Teams repeatedly deployed 𝑁 𝑁𝑠 #𝑇𝑟 →#𝑇𝑠 𝑡𝑎𝑠𝑘 [117]. 6.3. Experiment Design In this chapter, we focus on solving Task 1 and Task 3b, as introduced in Chapter 5.3, shown in Fig. 32. Task 1 involves one obstacle in the field, and its task complexity is estimated as 4.6. Previous experiments have demonstrated that Task 1 can be accomplished well by individual learning teams. Task 3b involves three obstacles in the field which leaves little room for the dynamic box to pass through. Thus, its complexity is estimated as 8.9. Therefore, Task 1 is devised to address the first research question about the impact of team size on the knowledge transfer 131 process, while Task 3b is tested to investigate the other two research questions about the effects of knowledge quality and the team’s social ability. It should be noted that this chapter aims to investigate the knowledge transfer mechanism within the same task domain and to enhance group intelligence in order to successfully accomplish increasingly complex tasks. Figure 49. Experiment design for the knowledge transfer study. In the experiments, our focus is on the knowledge transfer mechanism between teams of varying sizes; thereby, we fix #Tr as a control variable. Furthermore, to investigate the influence of social ability on the transfer process, we utilize 10-agent teams with social abilities (ON5) to conduct Task 3b based on the previous tests. This experimental setting ensures that the exploration focuses on meaningful insights derived from successful training outcomes. In each group of studies, test team size (#Ts) is varied to encompass a broader range of scenarios. Specifically, the team sizes of 2, 4, 6, 8, 12, 14, 16, and 18 are selected, allowing for the examination of transfers to both smaller and larger teams. Similar to previous sessions, the task performance will be evaluated by querying the final configurations of the “L”-shape, which are represented in a scatter plot of self-angle α and relevant angle β, shown in Fig. 15. Besides, the quality of the transfer can be assessed by the difference between the task performances of the Train Team and its corresponding Test Team. 132 6.4. Results and Discussion 6.4.1 Effect of knowledge transfer among teams with different sizes As introduced, well-trained knowledge is embedded in neural networks and can be transferred among different teams. In this study, a set of 10-Agent teams (#Tr = 10) are trained from novice to expert in executing Task 1. The acquired knowledge by these teams has been previously proved in good knowledge quantity. Following obtaining the domain knowledge, the well-trained NNs are transferred to new teams with varying team sizes. The objective of this investigation is to address the research question: What are the effects of knowledge transfer between teams of different sizes? The training results are demonstrated in below Fig. 50 and Fig. 51. Fig. 50 presents the 10-Agent teams’ training results in Task 1 as a baseline, while Fig. 51 shows the results of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 (knowledge of 10-Agent Teams in Task 1) into Test Teams with the size of 2, 4, 6, 8, 12, 14, 16, and 18, respectively. In Fig. 50, it can be observed that the green dots are distributed densely around the red dot (goal), meaning that the quality of 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 is pretty high, and the Training Teams can implement a very good “L”-assembly. Figure 50. Final configurations of 10-agent teams (#Tr = #Ts = 10) in Task 1. 133 Figure 51. Final configurations of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 to teams with size of 2, 4, 6, 8, 12, 14, 16, 18 in Task 1. The sub-figures in Fig. 51 show the cases utilizing 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 in teams of #Ts = 2 to #Ts=18. Although the teams are assigned the same task, the outcomes and results vary greatly. The first four sub-figures show the Condition-2 situation (#Tr > #Ts). It can be seen that the distribution of experimental dots becomes sparser as the #Ts decreases. Especially in the case of (#Tr = 2), nearly no Test Team can output a good performance. On the other hand, the last four sub-figures present the Condition-3 situation (#Tr < #Ts). In contrast to the outcomes observed in Condition-2, the distributions in Condition-3 remain as dense as the Training Team’s performance. Notably, even 134 in the case where #Ts is set as 18, the knowledge demonstrated by the teams still maintains high quality. The results indicate a trend that the knowledge quality has a significant decline when transferred to smaller teams. However, when transferred to larger teams, the quality of knowledge tends to remain or only slightly dimmish. Moreover, as shown in Fig. 52, the trends are more evident when the results are quantitatively analyzed using the Euclidean distance, facilitating a clear demonstration of the observed patterns. In Fig. 52, the x-axis represents #Ts ranging from 2 to 18, while the y-axis shows the Euclidean distance values calculated for the 20 cases in one group of learning teams utilizing equation (40). The results are presented using boxplots, with the same meanings as in previous studies, where the box area shows the data range from the first quartile (Q1) to the third quartile (Q3). The flier points shown in Fig. 52 are those out of the 1.5 * IQR (Q3-Q1). It can be observed that cases in Condition-3 (#Ts = 12, 14, 16, and 18) keep relatively similar results compared with the Training Team (#Tr = #Ts = 10). They demonstrate relatively the same medians and deviations. However, the training results become worse in Condition-2 (#Ts = 2, 4, 6, and 8), suffering larger medians and deviations. 135 Figure 52. Boxplot of Euclidean Distance of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 to teams with size of 2, 4, 6, 8, 10, 12, 14, 16, 18 in Task 1. The observed tendency provides valuable insights into the mechanism of knowledge transfer between different teams. Members of Training Teams acquire expertise in respect of their own specialization, enabling them to accomplish tasks cooperatively. When the knowledge is transferred to larger teams, the labor division can be relatively flexible, which increases the possibility of success. Even if some individual agents within the Test Teams do not perform optimally, the overall team can still function effectively. However, transferring knowledge becomes more challenging when Test Teams have fewer team members. Since the 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 1 hold a certain amount of specialized knowledge for specific roles within a team, a smaller Test Team size implies that certain knowledge may not be fully transferred. That requires members within smaller Test Teams to possess higher competence in the assigned task, which leads to increased difficulty and potentially hinders the quality of team performance. 136 6.4.2 Impact of knowledge quality in transforming In this study, we aim to examine the influence of knowledge quality on the transforming process in order to address the research question, what is the impact of knowledge quality on the transformation? In Chapter 6.4.1, the 10-agent Training Teams can master well how to perform Task 1 and hold a great knowledge quality. Thus, the impact of high-quality knowledge has already been established in the last sub-section. To investigate the effects of low-quality of knowledge and delve into the difference, we employ the same teams to tackle Task 3b, considering the complexity of this task. It has been shown from previous experiences that a learning team without social abilities would be hard to complete Task 3b with high quality. The task requires teams to exhibit high-quality cooperation in order to avoid collisions and assembly of an “L”-shape at the same time. Thus, a group of training teams lacking social abilities is trained specifically for Task 3b, serving as the knowledge sources to be transferred. It is anticipated that the knowledge quality in these teams is relatively low. Fig. 53 demonstrates the boxplots of training and transferred results in Task 3b. The fifth bar presents the performance of Training Teams while others display the transferred cases of #Ts in 2, 4, 6, 8, 12,14, 16, and 18. The results of the training teams in Task 3b have a higher median, larger deviation, and more outliers comparing the same teams’ performance in Task 1 in Fig. 52. This suggests that the trained knowledge quality to be transferred in this study is relatively lower. 137 Figure 53. Boxplot of Euclidean Distance of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 3𝑏 to teams with size of 2, 4, 6, 8, 10, 12, 14, 16, 18 in Task 3b. Under this circumstance of lower quality of trained knowledge, it can be seen that the transferred knowledge qualities are all not satisfied and even worse in both Condition-2 and Condition-3, shown in Fig. 53. The dots of all groups are very sparse and have large deviations. Moreover, as the team size diverges further from the training team, the trained knowledge quality suffers more significantly. This observation highlights the increased challenging in multi-agent systems when team sizes are significantly different. The results indicate that maintaining a good quality of trained knowledge for transfer is crucial and necessary. When the trained knowledge quality is low, the Test Teams’ performances become worse regardless of the size of the Test Team, and the deviations caused by different team sizes can be amplified due to the low-quality sources. 6.4.3 Impact of social ability on knowledge transfer This sub-section focuses on discussing the impact of social abilities on knowledge transfer in order to explore the research question: whether the effect may change when the trained teams have the social ability. Thus, we add the social ability into the agent teams for implementing Task 3b via 138 the method outlined in Chapter 5.2. Previous studies in Chapter 5 have shown that incorporating social ability enhances team cooperation to tackle the challenges of multiple obstacles. Thereby this experimental setting ensures that the exploration is based on successful training outcomes. The team sizes of Test Teams are set of 2, 4, 6, 8, 12, 14, 16, and 18. Fig. 54 shows the performance of 10-agent Training Teams with social ability in Task 3b. The data dots distribute densely around the goal, and there are hardly any outliers observed. Compared with the Training Teams’ performance, presented as the fifth bar in Fig. 53, it can be seen that the social teams’ performance has improved greatly. The dots distribution has a smaller mean, deviation, and few outliers, which indicates the trained knowledge quality for Task 3b improves. Figure 54. Final configurations of 10-agent teams with social abilities (#Tr = #Ts = 10) in Task 3b. 139 Figure 55. Final configurations of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 3𝑏 to teams with social abilities with the size of 2, 4, 6, 8, 10, 12, 14, 16, and 18 in Task 3b. Sub-figures in Fig. 55 show the output of transferring NNs #Tr=10 𝑇𝑎𝑠𝑘 3𝑏 in Test Teams in sizes of 2, 4, 6, 8, 12, 14, 16, and 18, respectively. The results show that the majority of Test Teams, including those with sizes of 2, 4, 6, 8, 14, 16, and 18, fail to effectively utilize the NN #Tr=10 Task3b , while only Test Teams with sizes of 12 group demonstrate good outputs with dense distributions. The trend is clearly illustrated in Fig. 56 as well, distinguishing it from the tendency of previous studies. In this case, even with high knowledge quality, transferring the knowledge to other teams proves to be considerably challenging. Only teams of similar sizes are able to achieve favorable results. 140 Figure 56. Boxplot of Euclidean Distance of transferring 𝑁 𝑁𝑠 #𝑇𝑟 =10 𝑇𝑎𝑠𝑘 3𝑏 to teams with social abilities with the size of 2, 4, 6, 8, 10, 12, 14, 16, and 18 in Task 3b. The results provide valuable insights into the significant impact of social ability on the knowledge transfer process. It is worth noting that a social team’s policies are learned from both environmental and social information. Thus, the acquired policies rely heavily on teamwork and cooperation, and the social team strategies exhibit a certain level of complication that may be hardly mimicked and executed by a new team of a different size. Consequently, it can be inferred that a team’s social ability makes the transfer more sensitive to the social characteristics of a system, where team size is just one such feature. That observation explains the mechanism difference between a social team between an individual learning team and why the knowledge trained from a social team is only easily transferred to similar team sizes. In summary, while social abilities possessed by a team enhance its capability to implement complex teamwork and strategies, it shows another trade-off in the form of difficulties in knowledge transfer and re-utilization. 141 6.5. Summary and Findings This chapter has discussed the knowledge transfer among the different sizes of teams in the context of “L”-shape assembly tasks involving collision avoidance in self-organizing systems. Guided by three research questions, this chapter conducts three comprehensive investigations to examine the mechanism of knowledge transfer and its influential factors. Building upon the findings presented by current experiments, several conclusions can be drawn: • In the same task domain, the transfer of acquired knowledge is more successful when it is transferred from trained teams to larger test teams compared to smaller teams. The quality of transferred knowledge is significantly compromised when transferred to smaller teams, resulting in a decline in performance. • The quality of the trained knowledge plays a crucial role in the process of knowledge transfer. When the quality of the trained knowledge is low, it becomes challenging to transfer it with satisfactory quality. • The presence of social ability significantly influences the knowledge transfer mechanism. The transfer process is highly sensitive to the team size, highlighting the need for careful consideration of these factors in the design and implementation of self-organizing systems. The transfer happening between social teams of similar sizes is able to achieve favorable results. The results of this study have demonstrated the potential benefits as well as limitations of knowledge transfer between teams with different time sizes in the same task domain. Valuable insights are offered as well as for further exploration. 142 Chapter 7. Conclusions 7.1. Summary This thesis aims at enhancing the effectiveness and intelligence of self-organizing systems to solve complicated engineering tasks. In order to achieve it, we have focused on exploring three key aspects of RL-based self-organizing systems, which are reward function design, social learning, and knowledge transfer. The research efforts collectively assist in designing a more autonomous and intelligent system. While reward function design, especially shaping rewards, works as an effective way to develop a more informative system, social learning can foster a more intelligent team by encouraging team members to learn from the interaction. Furthermore, given the high cost and resource-intensive features of RL training, the acquired knowledge transfer among different teams is worth our attention. First of all, we focus on reward shaping fields design to incorporate additional information into the systems. Designers strive to provide as much helpful information as possible to the learning agents, even holding very limited prior task knowledge. When the experiment results demonstrate the necessity of the additional aids for solving “L”-shape assembly tasks, the question arises as to what the appropriate formats for these additional signals are. In deep learning algorithms, learning relies on the gradient decent method to optimize a loss function. This process is analogous to maximizing the reward function in RL. We excitingly find that an appropriate representation of the shaped signals is significantly crucial for successful training, and the forms are proposed as convex functions with singularities, which outperforms others greatly. 143 The identification of a suitable form is the manual design of RS fields’ gradient distributions, and the impact of the gradients has been further investigated. It is observed that these carefully designed gradient distributions can augment the effect of limited information, as well as a pattern of how the inclinations can interact with the task performance. Gradients of RS fields play as important influencers in assisting agents in understanding the system's preferences, ultimately leading toward the reward summit. Besides, the inter-relationship between gradients and team size, an important feature of the system, has been investigated as well. Results demonstrate a trend that bigger teams favor shaper fields while small teams prefer mild fields. These findings have provided valuable insights for controlling a system and achieving desired outputs. Secondly, the lack of awareness of other team members in previous studies and the limitations of individual learning draw our attention. Excessive reliance on external information poses a significant burden on the system design. In response to these challenges, we focus on developing a fundamental function of natural groups into AI agent teams, which is social learning. To enable social learning in a multi-agent system, we propose a novel framework called MASo-DQL, which combines social structure with multi-agent deep Q-learning. In the context of "L"-assembly tasks with collision avoidance, social learning demonstrates remarkable potential in tackling tasks that are difficult for individual learning teams. Furthermore, teams possessing social abilities show enhanced exploration capabilities to execute more complex strategies. This benefits teams in terms of more reasonable work allocation and reduced idle agents during task execution. However, social learning is found to bring trade-offs and costs. Improper task assignment to a team may hurt learning and result in low energy utilization efficiency. 144 Furthermore, our research investigates the knowledge transfer between different teams in the same task domain and discusses its key influencers. One critical issue in RL methods is the required long training time and computational resources. That challenge becomes particularly sound in MARL since the costs increase exponentially as more learning agents get involved. What’s more, tasks in a MAS are accomplished through teamwork and cooperation. It means that successful training requires not only the individual agent's understanding of the action choices based on its state but also the collective dynamics of the team. Thus, the inherent complexity of the transfer process in a multi-agent system makes the studies more challenging. Those motivate us to study and comprehend the knowledge transfer mechanism. The results in Chapter 6 have shed light on the team knowledge transfer properties. It is found that knowledge transfer is more successful when transferred from trained teams to larger teams compared to smaller teams. Additionally, both the quality of the acquired knowledge and the social abilities of the training team significantly influence the transfer process. On the one hand, high- quality knowledge serves as a critical prerequisite for successful transfer. On the other hand, when the training teams possess social abilities, the transfer becomes more difficult, particularly when transferring among teams of very different sizes. To summarize, the research efforts are expected to reveal a pattern in the behavior codes of AI teams and bridge the mutual understanding between humans and agents. We are striving to facilitate system design to achieve an intelligent and autonomous system. It is hoped that this work holds significant value and warrants further exploration in the realm of mechanical engineering. 145 7.2. Contributions The research conducted in this thesis has significantly contributed to the system design area and learning techniques for self-organization. The work elevates the effectiveness and intelligence of self-organized systems for more complicated engineering tasks. More specifically, the contributions span the areas of system design, social learning in multiagent systems, and knowledge reuse among agent teams of different sizes, as highlighted below. • Move from low-level task description (e.g., task fields and social fields) to high-level reward fields representation for system design. o Developed a general mathematical representation of shaping rewards, which can utilize human knowledge efficiently and enhance team performance greatly. o Eliminated the need for a description of task fields by introducing shaping reward fields for self-organized agents to learn their own task knowledge. • Foster more capable agent teams through social learning. o Filled the gap of lack of awareness of other team members in the reinforcement learning process. o Revealed the social effects and their deep reasons in multiagent teams working on tasks of varying complexity levels. • Shed light on the reuse of acquired knowledge. 146 o Showed the significant potential of knowledge reuse for decreasing the huge cost of training agent teams. o Deepened the understanding of knowledge transfer mechanisms among teams of different sizes. 7.3. Future Directions In the future, the world is envisioned that robot teams are paired with human workers to combine the strength and speed of robots with the creativity and ingenuity of humans. In order to achieve that, an in-depth understanding of artificial group behaviors and the mechanism of human- robot interaction is still in big demand. The present work has demonstrated the potential of learning-based approaches to control and organize multiagent systems. Therefore, future work should aim at designing multiagent teams with high-level intelligence and autonomy with the lowest cost of design, even being free of the designer’s knowledge limitation, to solve complex engineering tasks in harsh environments. The future directions include: • Development of the potential-based reward shaping fields. The RS fields have been proven to possess great advantages of both field-based and learning-based approaches, which let robots learn from an informative system where the knowledge can be represented in a very simple format. Therefore, it is important to continue to develop and complete the field building for more scenarios. Besides, the dimensions of simulations will expand from 2D to 3D, aligning with real practices. • Data-driven System Design. In order to further generalize the captured design knowledge, it is necessary to collect big data for large-scale training and in-depth 147 behavior mechanism analysis. The latent pattern of intelligent group behaviors is believed to be embedded in the data and expected to be revealed, pointing to AI system design solutions with a specific task. Considering the randomness in the learning method, it is meaningful to explore if there exist generalized distributions of group behaviors for prediction. That benefits not only the design but saving of large computational sources. Moreover, the relationship and interactions among salient system feature such as reward function properties and team size and their impacts on learning efficiency and training quality will continue to be learned in my future plan. • Knowledge Transfer and Inheritance in Different Intelligent Groups. In real practice, it is desirable to maximize the use of knowledgeable brains (e.g., neural networks) rather than train from novice every time. Thus, the study of the knowledge transfer mechanism based on current findings will be needed. Given that team knowledge is encoded within neural networks, it is feasible to explore the intricacies of how team knowledge is possessed and distributed within a team by examining and comparing the trained parameters of the different neural networks. Furthermore, it would be intriguing to see the specific mechanisms underlying the automatic allocation of teamwork within a team, as this can facilitate more effective knowledge transfer. It is possible that certain "powerful" agents, exhibiting dominant roles in teamwork, hold higher transfer values. That work is believed to deepen the understanding of team intelligence and elevate its capabilities. • Human-robot Interaction and Collaboration. The co-existence and cooperation of robots and human workers will become an important work mode in engineering. The cooperation forms among teams need to be expanded to speech and image recognition 148 in addition to regular sensors. Importantly, how to represent information in an intelligent system needs to be addressed and expanded to representation learning. The research in this direction will help human-in-the-loop operations to be more intelligent and boost feasibility on multi-objectives missions. 149 Bibliography [1] Crommelinck, M., Feltz, B., & Goujon, P. (Eds.). (2006). Self-organization and emergence in life sciences (p. 360). Dordrecht, The Netherlands:: Springer. [2] Klir, G. J., & Ashby, W. R. (1991). Requisite variety and its implications for the control of complex systems. Facets of systems science, 405-417. [3] Ji, H., & Jin, Y. (2019, August). Designing Self-Organizing Systems With Deep Multi- Agent Reinforcement Learning. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (Vol. 59278, p. V007T06A019). American Society of Mechanical Engineers. [4] Jin, Y., & Chen, C. (2014). Cellular self-organizing systems: A field-based behavior regulation approach. AI EDAM, 28(2), 115-128. [5] Brambilla, M., Ferrante, E., Birattari, M., & Dorigo, M. (2013). Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence, 7, 1-41. [6] Smith, D. R. (1985). The design of divide and conquer algorithms. Science of Computer Programming, 5, 37-58. [7] Papadopoulou, M., Hildenbrandt, H., Sankey, D. W., Portugal, S. J., & Hemelrijk, C. K. (2022). Self-organization of collective escape in pigeon flocks. PLoS Computational Biology, 18(1), e1009772. [8] Zouein, G., Chen, C., & Jin, Y. (2011). Create adaptive systems through “DNA” guided cellular formation. In Design Creativity 2010 (pp. 149-156). Springer London. [9] The Editors of Encyclopaedia Britannica, "DNA", Retrieved from https://www.britannica.com/science/DNA [10] Chen, C., & Jin, Y. (2011, January). A behavior based approach to cellular self-organizing systems design. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (Vol. 54860, pp. 95-107). [11] Jin, Y., & Chen, C. (2014). Field based behavior regulation for self-organization in cellular systems. In Design Computing and Cognition'12 (pp. 605-623). Springer Netherlands. [12] Chiang, W., & Jin, Y. (2011, January). Toward a meta-model of behavioral interaction for designing complex adaptive systems. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (Vol. 54792, pp. 1077-1088). 150 [13] Khani, N., Humann, J., & Jin, Y. (2016). Effect of social structuring in self-organizing systems. Journal of Mechanical Design, 138(4), 041101. [14] Humann, J., Khani, N., & Jin, Y. (2014). Evolutionary computational synthesis of self- organizing systems. AI EDAM, 28(3), 259-275. [15] Ji, H., & Jin, Y. (2022). Knowledge Acquisition of Self-Organizing Systems With Deep Multiagent Reinforcement Learning. Journal of Computing and Information Science in Engineering, 22(2). [16] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533. [17] Ashby, W. R. (2017). Principles of the self-organizing system. In Systems Research for Behavioral Sciencesystems Research (pp. 108-118). Routledge. [18] Eder, W. E. (2009, January). Why Systematic Design Engineering?. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (Vol. 49057, pp. 513-522). [19] Ullman, D. (2009). EBOOK: The Mechanical Design Process. McGraw Hill. [20] Sarcar, M. M. M., Rao, K. M., & Narayan, K. L. (2008). Computer aided design and manufacturing. PHI Learning Pvt. Ltd.. [21] Tomiyama, T. (1994). From general design theory to knowledge-intensive engineering. AI EDAM, 8(4), 319-333. [22] Tomiyama, T., Umeda, Y., & Kiriyama, T. (1996). A framework for knowledge intensive engineering. In Computer Aided Systems Theory—CAST'94: 4th International Workshop Ottawa, Ontario, Canada, May 16–20, 1994 Selected Papers (pp. 123-147). Springer Berlin Heidelberg. [23] Suh, N. P. (1998). Axiomatic design theory for systems. Research in engineering design, 10(4). [24] Armitage, J. (2004). Are agile methods good for design?. interactions, 11(1), 14-23. [25] Sycara, K. P. (1998). Multiagent systems. AI magazine, 19(2), 79-79. [26] Weiss, G. (Ed.). (1999). Multiagent systems: a modern approach to distributed artificial intelligence. MIT press. 151 [27] Durfee, E. H., & Rosenschein, J. S. (1994, July). Distributed problem solving and multi- agent systems: Comparisons and examples. In Proceedings of the Thirteenth International Distributed Artificial Intelligence Workshop (pp. 94-104). [28] Pérez, S., Cuesta, C. E., & Ossowski, S. (2009). The agreement as an adaptive architecture for open multi-agent systems. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos, 3(4). [29] DI MARZO, S. G., Gleizes, M. P., & Karageorgos, A. (2005). Self-organization in multi- agent systems. The Knowledge Engineering Review, 20(2), 165-189. [30] Tianfield, H., & Unland, R. (2005). Towards self-organization in multi-agent systems and grid computing. Multiagent and Grid Systems, 1(2), 89-95. [31] Sohrabi, K., Gao, J., Ailawadhi, V., & Pottie, G. J. (2000). Protocols for self-organization of a wireless sensor network. IEEE personal communications, 7(5), 16-27. [32] Sun, C., Wang, X., Qiu, H., & Zhou, Q. (2021). Game theoretic self-organization in multi- satellite distributed task allocation. Aerospace Science and Technology, 112, 106650. [33] Ducatelle, F., Di Caro, G. A., & Gambardella, L. M. (2010, July). Cooperative self- organization in a heterogeneous swarm robotic system. In Proceedings of the 12th annual conference on Genetic and evolutionary computation (pp. 87-94). [34] Chiang, W., & Jin, Y. (2012, August). Design of cellular self-organizing systems. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (Vol. 45028, pp. 511-521). American Society of Mechanical Engineers. [35] Jin, Y., & Chen, C. (2014). Cellular self-organizing systems: A field-based behavior regulation approach. AI EDAM, 28(2), 115-128. [36] Ye, D., Zhang, M., & Vasilakos, A. V. (2016). A survey of self-organization mechanisms in multiagent systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(3), 441-461. [37] Aggarwal, C. C. (2018). Neural networks and deep learning. Springer, 10(978), 3. [38] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444. [39] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press. [40] Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. The Psychological Review: Monograph Supplements, 2(4), i. 152 [41] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. [42] R. Bellman, "A Markovian decision process," Journal of mathematics and mechanics, pp. 679-684, 1957. [43] T. T. Nguyen, N. D. Nguyen and S. Sahavandi, "Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications," IEEE transactions on cybernetics , Vols. 50-9, pp. 3826-3839, 2020. [44] Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34-37. [45] James, F. (1980). Monte Carlo theory and practice. Reports on progress in Physics, 43(9), 1145. [46] Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58-68. [47] Hernandez-Leal, P., Kartal, B., & Taylor, M. E. (2019). A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33(6), 750- 797. [48] Vlassis, N. (2007). A concise introduction to multiagent systems and distributed artificial intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learning, 1(1), 1- 71. [49] Watkins, C. J., & Dayan, P. (1992). Q-Learning, M. Learning. [50] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 [51] Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., ... & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4), e0172395. [52] Sen, S., & Weiss, G. (1999). Learning in multiagent systems. Multiagent systems: A modern approach to distributed artificial intelligence, 259-298. [53] Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Pieter Abbeel, O., & Mordatch, I. (2017). Multi- agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30. 153 [54] Sheikh, H. U., & Bölöni, L. (2020, July). Multi-agent reinforcement learning for problems with combined individual and team reward. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE. [55] Cai, Z., Cao, H., Lu, W., Zhang, L., & Xiong, H. (2021). Safe multi-agent reinforcement learning through decentralized multiple control barrier functions. arXiv preprint arXiv:2103.12553. [56] Littman, M. L. (2001). Value-function reinforcement learning in Markov games. Cognitive systems research, 2(1), 55-66. [57] Lauer, M. (2000). An algorithm for distributed reinforcement learning in cooperative multiagent systems. In Proc. 17th International Conf. on Machine Learning. [58] Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994 (pp. 157-163). Morgan Kaufmann. [59] Foerster, J., Assael, I. A., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29. [60] Sukhbaatar, S., & Fergus, R. (2016). Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29. [61] Brys, T. (2016). Reinforcement Learning with Heuristic Information (Doctoral dissertation, PhD thesis, PhD thesis, Vrije Universitet Brussel). [62] Lazaric, A., Restelli, M., & Bonarini, A. (2008, July). Transfer of samples in batch reinforcement learning. In Proceedings of the 25th international conference on Machine learning (pp. 544-551). [63] Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). [64] Lee, J. (2017). A survey of robot learning from demonstrations for human-robot collaboration. arXiv preprint arXiv:1710.08789. [65] Taylor, M. E., & Stone, P. (2007, June). Cross-domain transfer for reinforcement learning. In Proceedings of the 24th international conference on Machine learning (pp. 879-886). [66] Lazaric, A. (2008). Knowledge transfer in reinforcement learning (Doctoral dissertation, PhD thesis, Politecnico di Milano). 154 [67] Taylor, M. E., Suay, H. B., & Chernova, S. (2011, May). Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2 (pp. 617-624). [68] Atkeson, C. G., & Schaal, S. (1997, July). Robot learning from demonstration. In ICML (Vol. 97, pp. 12-20). [69] Smart, W. D., & Kaelbling, L. P. (2002, May). Effective reinforcement learning for mobile robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292) (Vol. 4, pp. 3404-3410). IEEE. [70] Mannion, P., Devlin, S., Duggan, J., & Howley, E. (2018). Reward shaping for knowledge- based multi-objective multi-agent reinforcement learning. The Knowledge Engineering Review, 33, e23. [71] Devlin, S., Kudenko, D., & Grześ, M. (2011). An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems, 14(02), 251-278. [72] Devlin, S., Yliniemi, L., Kudenko, D., & Tumer, K. (2014, May). Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems (pp. 165-172). [73] Wu, H., Li, H., Zhang, J., Wang, Z., & Zhang, J. (2021). Generating individual intrinsic reward for cooperative multiagent reinforcement learning. International Journal of Advanced Robotic Systems, 18(5), 17298814211044946. [74] Mao, H., Gong, Z., & Xiao, Z. (2020). Reward design in cooperative multi-agent reinforcement learning for packet routing. arXiv preprint arXiv:2003.03433. [75] Huang, B., & Jin, Y. (2022). Reward shaping in multiagent reinforcement learning for self- organizing systems in assembly tasks. Advanced Engineering Informatics, 54, 101800. [76] Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In Icml (Vol. 99, pp. 278-287). [77] Proper, S., & Tumer, K. (2012, June). Modeling difference rewards for multiagent learning. In AAMAS (pp. 1397-1398). [78] Wiewiora, E., Cottrell, G. W., & Elkan, C. (2003). Principled methods for advising reinforcement learning agents. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 792-799). 155 [79] Grzes, M., & Kudenko, D. (2008, September). Plan-based reward shaping for reinforcement learning. In 2008 4th International IEEE Conference Intelligent Systems (Vol. 2, pp. 10- 22). IEEE. [80] Devlin, S., & Kudenko, D. (2016). Plan-based reward shaping for multi-agent reinforcement learning. The Knowledge Engineering Review, 31(1), 44-58. [81] Badnava, B., Esmaeili, M., Mozayani, N., & Zarkesh-Ha, P. (2023, March). A new potential-based reward shaping for reinforcement learning agent. In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 01-06). IEEE. [82] Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E., & Nowé, A. (2015, June). Reinforcement learning from demonstration through shaping. In Twenty-fourth international joint conference on artificial intelligence. [83] Agogino, A. K., & Tumer, K. (2004, January). Unifying temporal and structural credit assignment problems. In Autonomous Agents and Multi-Agent Systems Conference. [84] Proper, S., & Tumer, K. (2012, June). Modeling difference rewards for multiagent learning. In AAMAS (pp. 1397-1398). [85] Marthi, B. (2007, June). Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine learning (pp. 601-608). [86] Rescher, N. (1998). Complexity: A philosophical overview. Transaction Publishers. [87] Topi, H., Valacich, J. S., & Hoffer, J. A. (2005). The effects of task complexity and time availability limitations on human performance in database query tasks. International journal of human-computer studies, 62(3), 349-379. [88] Marir, T., Mokhati, F., Bouchelaghem-Seridi, H., & Tamrabet, Z. (2014). Complexity measurement of multi-agent systems. In Multiagent System Technologies: 12th German Conference, MATES 2014, Stuttgart, Germany, September 23-25, 2014. Proceedings 12 (pp. 188-201). Springer International Publishing. [89] Liu, X., & Jin, Y. (2020). Reinforcement learning-based collision avoidance: impact of reward function and knowledge transfer. AI EDAM, 34(2), 207-222. [90] Sarkar, A., & Debnath, N. C. (2012, July). Measuring complexity of multi-agent system architecture. In IEEE 10th International Conference on Industrial Informatics (pp. 998- 1003). IEEE. [91] Klügl, F. (2008). Measuring complexity of multi-agent simulations–an attempt using metrics. In Languages, Methodologies and Development Tools for Multi-Agent Systems: 156 First International Workshop, LADS 2007, Durham, UK, September 4-6, 2007. Revised Selected Papers 1 (pp. 123-138). Springer Berlin Heidelberg. [92] Özgelen, A. T., & Sklar, E. I. (2014, March). Modeling and analysis of task complexity in single-operator multi-robot teams. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction (pp. 262-263). [93] Rouse, W. B., & SH, R. (1979). Measures of complexity of fault diagnosis tasks. [94] Wood, R. E. (1986). Task complexity: Definition of the construct. Organizational behavior and human decision processes, 37(1), 60-82. [95] Farina, A. J., Wheaton, G. R., & Fleishman, E. A. (1971). Development of a taxonomy of human performance: The task characteristics approach to performance prediction. AMERICAN INSTITUTES FOR RESEARCH PITTSBURGH PA. [96] Li, Y., & Belkin, N. J. (2008). A faceted approach to conceptualizing tasks in information seeking. Information processing & management, 44(6), 1822-1837. [97] Ham, D. H., Park, J., & Jung, W. (2012). Model-based identification and use of task complexity factors of human integrated systems. Reliability Engineering & System Safety, 100, 33-47. [98] Liu, P., & Li, Z. (2012). Task complexity: A review and conceptualization framework. International Journal of Industrial Ergonomics, 42(6), 553-568. [99] Ndousse, K., Eck, D., Levine, S., & Jaques, N. (2020). Learning social learning. In NeurIPS Workshop on Cooperative AI. [100] Bandura, A., & Walters, R. H. (1977). Social learning theory (Vol. 1). Prentice Hall: Englewood cliffs. [101] Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P., Strouse, D. J., ... & De Freitas, N. (2019, May). Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning (pp. 3040-3049). PMLR. [102] Sequeira, P., Melo, F. S., Prada, R., & Paiva, A. (2011, August). Emerging social awareness: Exploring intrinsic motivation in multiagent learning. In 2011 IEEE international conference on development and learning (ICDL) (Vol. 2, pp. 1-6). IEEE. [103] Ndousse, K. K., Eck, D., Levine, S., & Jaques, N. (2021, July). Emergent social learning via multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 7991-8004). PMLR. 157 [104] Borsa, D., Piot, B., Munos, R., & Pietquin, O. (2017). Observational learning by reinforcement learning. arXiv preprint arXiv:1706.06617. [105] Peysakhovich, A., & Lerer, A. (2017). Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865. [106] Liu, B., Singh, S., Lewis, R. L., & Qin, S. (2014). Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4), 286-297. [107] Piot, B., Geist, M., & Pietquin, O. (2013). Learning from demonstrations: Is it worth estimating a reward function?. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I 13 (pp. 17-32). Springer Berlin Heidelberg. [108] Tao, Y., Genc, S., Chung, J., Sun, T., & Mallya, S. (2021, July). Repaint: Knowledge transfer in deep reinforcement learning. In International Conference on Machine Learning (pp. 10141-10152). PMLR. [109] Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342 [110] Ferns, N., Panangaden, P., & Precup, D. (2004, July). Metrics for Finite Markov Decision Processes. In UAI (Vol. 4, pp. 162-169). [111] Phillips, C. (2006). Knowledge transfer in Markov decision processes. Technical report, McGill University, School of Computer Science, 2006. URL http://www. cs. mcgill. ca/cphill/CDMP/summary. pdf. [112] Glatt, R., Da Silva, F. L., & Costa, A. H. R. (2016, October). Towards knowledge transfer in deep reinforcement learning. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 91-96). IEEE. [113] Liu, W., Dong, L., Liu, J., & Sun, C. (2022). Knowledge transfer in multi-agent reinforcement learning with incremental number of agents. Journal of systems engineering and electronics, 33(2), 447-460. [114] Hu, Y., Gao, Y., & An, B. (2015, May). Learning in Multi-agent Systems with Sparse Interactions by Knowledge Transfer and Game Abstraction. In AAMAS (pp. 753-761). [115] Watkins, C. J. C. H. (1989). Learning from delayed rewards. [116] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. 158 [117] Ji, H., & Jin, Y. (2021). Evaluating the learning and performance characteristics of self- organizing systems with different task features. AI EDAM, 35(4), 404-422. [118] Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., & Whiteson, S. (2020). Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1), 7234-7284. [119] Wang, Y., & De Silva, C. W. (2006, October). Multi-robot box-pushing: Single-agent q- learning vs. team q-learning. In 2006 IEEE/RSJ international conference on intelligent robots and systems (pp. 3694-3699). IEEE. [120] Ji, H., & Jin, Y. (2021). Evaluating the learning and performance characteristics of self- organizing systems with different task features. AI EDAM, 35(4), 404-422. [121] Grześ, M., & Kudenko, D. (2010). Online learning of shaping rewards in reinforcement learning. Neural networks, 23(4), 541-550. [122] Hackman, J. R. (1969). Toward understanding the role of tasks in behavioral research. Acta psychologica, 31, 97-128. [123] Jacko, J. A., Salvendy, G., & Koubek, R. J. (1995). Modelling of menu design in computerized work. Interacting with Computers, 7(3), 304-330. [124] Schwab, D. P., & Cummings, L. L. (1976). A theoretical analysis of the impact of task scope on employee performance. Academy of Management Review, 1(2), 23-35. [125] Genly, B. (2016). Safety and job burnout: Understanding complex contributing factors. Professional Safety, 61(10), 45-49. [126] Shaker, N. (2016, September). Intrinsically motivated reinforcement learning: A promising framework for procedural content generation. In 2016 IEEE Conference on Computational Intelligence and Games (CIG) (pp. 1-8). IEEE. [127] Shinners. P. (2011). "Pygame-Python Game Development. Retrieved from http://www.pygame.org". [128] Blomqvist. V. (2007) "Pymunk: A easy-to-use pythonic rigid body 2d physics library (version 5.6.0).Opgehaal van http://www.pymunk.org". [129] The Editors of Encyclopaedia Britannica, Socialization, Retrieved from https://www.britannica.com/science/socialization. [130] Heyes, C. M. (1994). Social learning in animals: categories and mechanisms. Biological Reviews, 69(2), 207-231. 159 [131] Plan, E. E. (2011). Brussels. Belgium: European Commission. [132] Carabin, G., Wehrle, E., & Vidoni, R. (2017). A review on energy-saving optimization methods for robotic and automatic systems. Robotics, 6(4), 39. [133] Yan, Y., & Mostofi, Y. (2014). To go or not to go: On energy-aware and communication- aware robotic operation. IEEE Transactions on Control of Network Systems, 1(3), 218-231. [134] Le, N., Brabazon, A., & O’Neill, M. (2020). Social learning vs self-teaching in a multi- agent neural network system. In Applications of Evolutionary Computation: 23rd European Conference, EvoApplications 2020, Held as Part of EvoStar 2020, Seville, Spain, April 15– 17, 2020, Proceedings 23 (pp. 354-368). Springer International Publishing.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transfer reinforcement learning for autonomous collision avoidance
PDF
Large-scale path planning and maneuvering with local information for autonomous systems
PDF
Dynamic social structuring in cellular self-organizing systems
PDF
Computer aided visual analogy support (CAVAS) for engineering design
PDF
Building cellular self-organizing system (CSO): a behavior regulation based approach
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
AI-driven experimental design for learning of process parameter models for robotic processing applications
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Document understanding based design support (DocUDS) for augmenting engineering design
PDF
A meta-interaction model for designing cellular self-organizing systems
PDF
Evaluating sensing and control in underwater animal behaviors
PDF
Understanding goal-oriented reinforcement learning
PDF
Development and applications of a body-force propulsor model for high-fidelity CFD
PDF
Machine learning in interacting multi-agent systems
PDF
Emotional appraisal in deep reinforcement learning
PDF
Fabrication-aware machine learning for accuracy control in additive manufacturing
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Robust and adaptive online reinforcement learning
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Leveraging prior experience for scalable transfer in robot learning
Asset Metadata
Creator
Huang, Bingling
(author)
Core Title
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Mechanical Engineering
Degree Conferral Date
2023-08
Publication Date
06/27/2023
Defense Date
06/08/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
design,OAI-PMH Harvest,reinforcement learning,reward shaping,self-organized systems,social learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jin, Yan (
committee chair
), Bermejo-Moreno, Ivan (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
bingling1020@outlook.com,binglinh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113259706
Unique identifier
UC113259706
Identifier
etd-HuangBingl-11992.pdf (filename)
Legacy Identifier
etd-HuangBingl-11992
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Bingling
Internet Media Type
application/pdf
Type
texts
Source
20230630-usctheses-batch-1059
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
reinforcement learning
reward shaping
self-organized systems
social learning