Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adaptive agents on evolving network: an evolutionary game theory approach
(USC Thesis Other)
Adaptive agents on evolving network: an evolutionary game theory approach
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Adaptive agents on evolving network: An evolutionary game theory approach by Ardeshir Kianercy A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Mechanical Engineering) May 2013 Copyright 2013 Ardeshir Kianercy Acknowledgements I was lucky to encounter many individuals at different times in my life. They all help me to progress in my academic life and toward obtaining my PhD. Following figure is my current academic chronology which shows the impact of influential individuals. Working together on a frui/ul project in Barcelona Dr.Pigolo8.S , Juul.J, Dr.Bernhardsson.S Uncondi?onal helps, great discussions and friendship Dr.Greg Ver Steeg Members of my PhD commiFee, Prof.J.D.Carrillo: my first mentor in Game Theory(2007) Prof. Paul K. Newton, Prof.Roger G. Ghanem, Prof. Juan.D.Carrillo SUPERVISOR , sincere mentor, and devoted supporter during my PhD thesis Professor Aram Galstyan First mentor in dynamical systems Professor Firdus Udwadia Support me to admit and get scholarship at USC Professor Satwindar S. Sadhal First mentor in cri?cal thinking. My uncle Mohammad Reza Shahshahani: Deceased at age 23 : Tortured and Executed by Islamic republic government officers in Tehran/Iran, 1981 My Parents 1980 2004 2005 2009 2010 2012 Figure 1: Support of individuals: This study was impossible without them. ii Abstract We consider the dynamics of Q–learning in two–player two–action games with a Boltzmann ex- ploration mechanism. For any non–zero exploration rate the dynamics is dissipative, which guar- antees that agent strategies converge to rest points that are generally different from the game’s Nash Equlibria (NE). We provide a comprehensive characterization of the rest point structure for different games, and examine the sensitivity of this structure with respect to the noise due to ex- ploration. Our results indicate that for a class of games with multiple NE the asymptotic behavior of learning dynamics can undergo drastic changes at critical exploration rates. Furthermore, we demonstrate that for certain games with a single NE, it is possible to have additional rest points (not corresponding to any NE) that persist for a finite range of the exploration rates and disappear when the exploration rates of both players tend to zero. In addition to BoltzmannQ learning we study adaptive dynamics in games where players abandon the population at a given rate, and are replaced by naive players characterized by a prior distribution over the admitted strategies. We show how the Nash equilibria are modified by the turnover, and study the changes in dynamical features of the system for prototypical examples such as different classes of two-action games played between two distinct populations. Finally, we presents a model of network formation in repeated games where the players adapt their strategies and network ties simultaneously using a simple reinforcement learning scheme. It iii is demonstrated that the co-evolutionary dynamics of such systems can be described via coupled replicator equations. We provide a comprehensive analysis for three-player two-action games, which is the minimum system size with non-trivial structural dynamics. In particular, we char- acterize the Nash Equilibria (NE) in such game, and examine the local stability of the rest points corresponding to those equilibria. We also study general N-player networks via both simulations and analytical methods, and find that in the absence of exploration, the stable equilibria consist of star motifs as the main building blocks of the network. Furthermore, in all stable equilibria the agents play pure strategies, even when the game allows mixed NE. Finally, we study the impact of exploration on learning outcomes, and observe that there is a critical exploration rate above which the symmetric and uniformly connected network topology becomes stable. iv Table of Contents Acknowledgements ii Abstract iii List of Figures viii I Introduction 1 Chapter 1 OVERVIEW 2 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Reinforcment learning to reach beyond Nash Equilibrium . . . . . . . . . . . . . 2 1.3 Naive player in population during learning . . . . . . . . . . . . . . . . . . . . . 3 1.4 Co-evolving network topology and players strategy . . . . . . . . . . . . . . . . 5 1.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.1 Boltzmann Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.2 Dynamics with naive players . . . . . . . . . . . . . . . . . . . . . . . 7 1.5.3 Co-evolving dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2 RESEARCH PROBLEMS DEFINITION 8 Chapter 3 CONTRIBUTIONS AND OUTLINE 10 II Background 13 Chapter 4 GAME DEFINITION AND LEARNING IN GAMES 14 4.1 Normal game definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.1 Nash Equilibrium of a normal game . . . . . . . . . . . . . . . . . . . . 15 4.2 Replicator Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Quantal Response Equilibria (QRE) and logit equilibrium . . . . . . . . . . . . . 17 4.4 Logit equilibrium as fixed point of Learning dynamics . . . . . . . . . . . . . . 18 4.4.1 Perturbed (Smoothed) best response dynamics . . . . . . . . . . . . . . 18 4.4.2 Stochastic evolutionary game theory . . . . . . . . . . . . . . . . . . . . 19 v III Bifurcation in population by learning in games 20 Chapter 5 BOLTZMANN Q-LEARNING DYNAMICS FOR TWO-ACTION TWO-PLAYER GAMES 21 5.1 Dynamics of Q–Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.1 Single Agent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.2 Two-agent learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.3 Exploration causes dissipation . . . . . . . . . . . . . . . . . . . . . . . 25 5.1.4 Two–action games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Analysis of Interior Rest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2.1 Symmetric Equlibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 6 LEARNING WITH NAIVE PLAYERS FOR TWO-ACTION TWO-PLAYER GAMES 42 6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2 Two-agent two-action games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.1 Matching pennies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2.2 Coordination game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 IV Co-evolution of networks and players strategy 53 Chapter 7 ADAPTING AGENTS ON EVOLVING NETWORKS 54 7.1 Co-Evolving Networks via Reinforcement Learning . . . . . . . . . . . . . . . 54 7.1.1 Two-action games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2 Learning without exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.2.1 3-player games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.2.1.1 Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.2.1.2 Stable rest points of learning dynamics . . . . . . . . . . . . . 64 7.2.2 N-player games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.3 Learning with Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 V Conclusion 71 VI Appendices 76 Chapter 8 BOLTZMANN Q-LEARNING APPENDICES 77 8.1 Classification of games according to the number of allowable rest-points . . . . . 77 8.2 Appearance of multiple rest points in games with single NE . . . . . . . . . . . . 78 Chapter 9 LEARNING WITH NAIVE PLAYERS APPENDICES 80 9.1 Number of turnover equilibria in two-agent two-action games . . . . . . . . . . . 80 vi Chapter 10 CO-EVOLUTIONARY LEARNING APPENDICES 83 10.1 Isolation cost (C p ) effect on learning dynamics . . . . . . . . . . . . . . . . . . . 83 10.2 Rest points and Local Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 10.3 Block matrix identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 10.4 Jacobian eigenvalues in Two-action Three-player Dynamics for homogenous so- lution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 10.5 homogenous Network implies Symmetric strategy . . . . . . . . . . . . . . . . . 88 10.6 N-player homogenous network solutions . . . . . . . . . . . . . . . . . . . . . . 89 References 92 vii List of Figures 1 Support of individuals: This study was impossible without them. . . . . . . . . . ii 5.1 The graphical illustration of the rest point equation for the symmetric case, Eq. 5.18. The solid curve corresponds to the RHS, and the three lines corre- spond to the LHS for subcritical, critical and supercritical temperature values, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Demonstration of the cusp bifurcation in the space of parameters a and b for symmetric equilibria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 (Color online) Graphical representation of the general rest point equation for two different values of c : Intersections represent rest points. . . . . . . . . . . . . . 31 5.4 (Color online) Characterization of different games in the parameter space with a;c> 0. Dark blue region corresponds to games that can have only a single rest point, whereas the games in the light grey regions can have three rest-points. The shaded grey square corresponds to games that have three Nash equilibria. . . . . 33 5.5 (Color online) Illustration of dynamical flow for a system with three (a) and single (b) rest points. Note that the middle rest point in (a) is unstable. . . . . . . . . . . 35 5.6 Examples of reward matrices for typical two-action games. . . . . . . . . . . . . 36 5.7 (Color online) Bifurcation diagram of the rest points for T X = T Y = T : (a) Discon- nected pitchfork, with mixed NE:(x ;y )=(2=5;2=5) (b) Continuous pitchfork, with mixed NE:(x ;y )=(1=3;2=3). . . . . . . . . . . . . . . . . . . . . . . . 38 5.8 (Color online) Bifurcation in the domain of the games with a;c> 0, b a > 0, 1 2 > d c >1. In this example we have: d c =0:8; b a = 0:1: a) Rest point structure plotted against T X for T Y < T c Y and T Y > T c Y . b) The rest point structure plotted against T Y for T c X < T X < T c + X (T Y ). In both graphs, the red dot-dashed lines correspond to the unstable rest points. . . . . . . . . . . . . . . . . . . . . . . . 40 viii 6.1 (Color online) Dynamics of the matching pennies game with r = 2 and ini- tial strategy x 0 = y 0 = 0:3. A For c = 0, the average strategy oscillates at a fixed distance from the Nash equilibrium. B When a turnover is introduced, the average strategy converges to the turnover equilibrium. Here c x = c y = 1. C Expected payoff for population one in the turnover equilibrium for varying turnovers. Above a critical value of c y = 2, the first population will win most games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 (Color online) Dynamics of a coordination game with initial strategy (x 0 ;y 0 )= (0:9;0:1). A Without turnover, the game has one saddle point Nash equilibrium with stable manifolds that separate the basins of attraction of two pure Nash equilibria. B-C When the initial strategy goes from one basin of attraction to another, the resulting equilibrium state changes discontinuously. Here c y = 2 and c x = 0:7 and 0:72, respectively. D-E At a critical set of turnover rates two turnover equilibria annihilates in a saddle node bifurcation. Here c y = 2 and c x = 0:91 and 0:92, respectively. F Bifurcation diagram for turnover equilibria. The point (x;y)=(0:4;0:9) is always an equilibrium. We observe a transcriti- cal bifurcation at c x 0:4 and the saddle node bifurcation from panels D-E at c x 0:9. G Expected payoff of population one in the turnover equilibrium for varying turnovers. The dramatic change in payoffs between panel B and C can clearly be seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.1 Categorization of 2-action games base on the reward matrix structure — (a;b) space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2 Examples of reward matrices for typical two-action games. . . . . . . . . . . . . 61 7.3 3-player Nash equilibrium for Prisoner’s Dilemma and Coordination game; see the text for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.4 3-player learning stable rest points (a) Prisoner’s Dilemma (b) Coordination game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.5 Observed stable configurations of co-evolutionary dynamics for T = 0. . . . . . 66 7.6 a) Possible network configurations for three-player PD (Fig. 7.2); (b) Bifurcation diagram for varying temperature. Two blue solid lines correspond to the config- urations with one isolated agent and one central agent. The symmetric network configuration is unstable at low temperature (red line), and becomes globally sta- ble above a critical temperature. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 ix 7.7 Domain of stable homogenous equilibria ( dark grey) for the Examples of Coor- dination game in Fig. (7.2). The top figure shows Bifurcation of strategy p versus T for three player game. The Bottom figure shows stable homogenous domain for players while choosing first action ( smaller grey area) and second action(larger grey area). Here the critical temperature is T c = 0:36. . . . . . . . . . . . . . . . 70 8.1 Graphical illustration of the multi-rest point equation for a game with a single NE. Here a;c> 0, b a = 1 2 , d c = 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.1 The number of turnover equilibria in a two-action two-agent game can be deter- mined by the number of intersections of left and right hand sides of equation 9.2. A Whenab < 0 the two functions have opposite slopes and hence intersect ex- actly once in the relevant interval 0< y < 1. Here the matching pennies game is illustrated with the parameters of Fig. 6.1B. B-C When ab > 0 the slope of the functions have the same sign, and new turnover equilibria can appear through bifurcations. Here the bifurcation in the coordination game of Fig. 9.1B-C is shown, except thatc x goes from to 0.8 to 1 between panels B-C. . . . . . . . . . 82 10.1 Demonstration of strategy bifurcation in the space of parameters a , b, T and n for homogenous equilibria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 x Part I Introduction 1 Chapter 1 OVERVIEW 1.1 Motivation In this study we consider two type of learning in normal games which obtains outcomes that are deviated from the Nash Equilibrium corresponded game. We first consider Boltzmann Re- inforcement learning. Next, we study learning where inexperience players shift the learning out come from Nash Equilibrium. Finally we defined a game at which players choose their strat- egy and their playmate simultaneously, consequently the network and strategy of players evolve simultaneously. 1.2 Reinforcment learning to reach beyond Nash Equilibrium Reinforcement Learning (RL) [52] is a powerful framework that allows an agent to behave near–optimally through a trial and error exploration of the environment. Although originally de- veloped for single agent settings, RL approaches have been extended to scenarios where multiple agents learn concurrently by interacting with each other. The main difficulty in multi–agent learn- ing is that, due to mutual adaptation of agents, the stationarity condition of single–agent learning 2 environment is violated. Instead, each agent learns in a time–varying environment induced by the learning dynamics of other agents. Although in general multi–agent RL does not have any formal convergence guarantees (except in certain settings), it is known to often work well in practice. 1.3 Naive player in population during learning Perhaps the most important skill for competing agents is the ability to adapt to a changing envi- ronment by constantly assessing and modifying their behavior. Consequently, while game theory has been initially mostly focused on the study of equilibria [38, 57], the study of adaptive dy- namics has acquired more and more relevance in recent years. Various models have been put forward to capture the learning processes of competing individuals and the resulting evolution of the population [4, 29, 58]. A key result is that many agent-based algorithms of adaptive dynam- ics allow for a simple macroscopic description in terms of replicator equations [11, 22, 40, 54]. In the replicator dynamics, a large population of individuals participate in a game. Let us call x i the fraction of population playing a given strategy, where admissible strategies are labeled as i=f1;2;3;:::g. The time evolution of such fractions is given by: d dt x i = x i (p i (x) ¯ p(x)); (1.1) where p i (x) is the frequency-dependent payoff of strategy i, and ¯ p =å i x i p i is the average payoff. In this setting, the initial condition x 0 i represent a prior distribution of strategy preferences, before the adaptation process takes place. Approaches based on Eq. (1.1) have proven successful in describing systems within biology as well as economics [35, 40, 48]. It can be easily shown 3 that stable equilibria of replicator dynamics correspond to Nash equilibria, where no individual can benefit from changing strategy unilaterally [22, 39]. Usually, in adaptive dynamics, one has in mind a fixed population of players that acquire experience over time, or biological populations, where offsprings inherit strategies from their parents. However, one can think of a number of concrete examples where games are played in a more open setting, with the possibility of players to leave the game and be replaced by less experienced ones. On the market, new companies are founded while old companies collapse. Within companies, experienced employees retire so that young graduates may start their career. On general grounds, one should expect turnover to have a profound impact on the dynamics of the game and lead to a rich phenomenlogy. In a standard adaptive dynamics all individuals within the population have the same degree of experience. Conversely, here each agent sees a non-trivial mixture of players with different experience levels. While the Nash equilibrium strategy will be optimal against very experienced players, it would not necessarily be the most effective to exploit naive newcomers. Therefore, one can expect adaptive dynamics in this case to converge to equilibria being different from Nash equilibria, and being crucially affected by the rate of turnover, which in turns determines the steady-state structure of the population both in terms of experience and strategies. In such framework, interesting information can be obtained by dissecting the experience composition of each strategy, for example to assess which experience classes are performing better in the game in terms of payoff. We demonstrate how taking into account turnover of players leads macroscopically to a novel variant of the replicator dynamics. We present a derivation of the replicator equation with turnover (Eq. 6.5) in Section 6.1. In the remainder of the paper, we apply this equation to analyze the ef- fect of turnover in simple evolutionary games. We begin with the simple paradigmatic cases of 4 the rock-paper-scissors game and the set of two-action games played between two different pop- ulations. In the latter case, we show how increasing the turnover rate can lead to abrupt change in the equilibrium state caused by bifurcations in the corresponding dynamical system. We con- clude the paper by showing how this approach can provide an interpretation for the observed bid distribution in online lowest unique bid auctions. 1.4 Co-evolving network topology and players strategy Networks depict complex systems where nodes correspond to entities and links encode inter- dependencies between them. Generally, dynamics in networks is introduced via two different approaches. In the first approach, the links are assumed to be static, while the nodes are endowed with internal dynamics (epidemic spreading, opinion formation, signaling, synchronizing and so on). And in the second approach, nodes are treated as passive elements, and the main focus is on the evolution of network topology. To describe coupled dynamics of individual attributes and network topology, here we suggest a simple model of co–evolving network that is based on the notion of interacting adaptive agents. Specifically, we propose network–augmented multi–agent systems where the agents play repeated games with their neighbors, and adapt both their behaviors and the network ties depending on the outcome of their interactions. To adapt, the agents use a simple learning mechanism to reinforce (penalize) behaviors and network links that produce favorable (unfavorable) outcomes. Further- more, the agents use an action selection mechanism that allows to control exploration/exploitation tradeoff via a temperature-like parameter. 5 1.5 Related Works 1.5.1 Boltzmann Q-learning Recently, a number of authors have addressed the issue of multi–agent learning from the perspec- tive of dynamical systems [2, 7, 49]. For instance, it has been noted that for stateless Q–learning with Boltzmann action selection, the dynamics of agent strategies can be described by (bi-matrix) replicator equations from population biology [25], with an additional term that accounts for the exploration [44, 46, 56]. A similar approach for analyzing learning dynamics with e-greedy exploration mechanism 1 was developed in [18, 62]. Most existing approaches so far have focused on numerical integration or simulation meth- ods for understanding dynamical behavior of learning systems. Recently, [62] provided a full categorization of e-greedy Q-learning dynamics in two–player two–action games using analyti- cal insights from hybrid dynamical systems. A similar classification for Boltzmann Q-learning, however, is lacking. On the other hand, a growing body of recent neurophysiological studies indicate that Boltzmann-type softmax action selection might be a plausible mechanism for un- derstanding decision making in primates. For instance, experiments with monkeys playing a competitive game indicate that their decision making is consistent with softmax value-based re- inforcement learning [31, 32]. It has also been observed that in certain observational learning tasks humans seem to follow a softmax reinforcement leaning scheme [3]. Thus, understanding softmax learning dynamics and its possible spectrum of behaviors is important both conceptually and for making concrete prediction about different learning outcomes. 1 The e-greedy Q-learning schema selects the action with highest Q value with probability (1e)+ e n and other actions with probability of e n , where n is the number of the actions. 6 1.5.2 Dynamics with naive players It is known that a fraction of non-rational, influenceable players may change the game dynamics considerably [21, 34, 41]. In a recent study [9], a population of fish were made up of three subgroups: a large group of fish with a small preference for going to one place in the aquarium, a smaller group with a strong preference for another place, and a group of untrained fish with no prior preference. The study showed, both through simulations and experiment, that there exists a critical size of the untrained group above which the entire population goes to the place prefered by group one, and below which the minority of group two dictates where the population goes. This resembles our results for two-agent two-action coordination games, where a critical turnover of agents changes the turnover equilibrium from being dominated by one strategy to the other. Another example comes from the idea of “cognitive hierarchy” [5, 8, 61] in human learning, where one analyzes games in which players have different “depth” of adaptation. In our case, a similar scenario emerges naturally from the dynamical equilibrium between adaptation and turnover. 1.5.3 Co-evolving dynamics More recently, it has been suggested that separating individual and network dynamics fails to capture realistic behavior of networks. Indeed, in most real–world networks both attributes of in- dividuals (nodes) and topology of the network (links) evolve in tandem. Models of such adaptive co-evolving networks have attracted significant interest in recent years [6, 10, 19, 20, 42, 43, 64]. 7 Chapter 2 RESEARCH PROBLEMS DEFINITION This research aim is to answer two kind of problems . In the first part of this study, we investigate on learning in games in two-action two-player games. We focus on two extension of the Replica- tor dynamics. First we consider replicator dynamics with an entropy term.Second we model the affect of the naive player in a population and the dynamics stability of fixed points. Here are the questions to answer : What is the affect of the noise on the learning dynamics (Boltzmann Q-learning) in two- action two player games? In a population with naive(inexperienced) players, what is the affect of naive players in the outcome of learning in normal two-action game two-player game? In the second part of this work, we choose the first type of the learning –Boltzmann Q- learning– or bounded rational players. We then study the co-evolution of bounded rational played on a network.The challenges to be answer are: How to model the mutual evolving behavior of bounded rational players on a network in the frame of strategic interaction? 8 Considering the simplest network with 3-players, What is the affect of noise on the outcome of strategies and network’s topology? For N-player game, what is the the outcome of strategies and network’s topology in a replicator dynamics?. 9 Chapter 3 CONTRIBUTIONS AND OUTLINE Here we use analytical techniques to provide a complete characterization of Boltzmann Q– Learning in two–player two–action games, in terms of their convergence properties and rest point structure. In particular, it is shown that for any finite (non–zero) exploration rate, the learning dynamics necessarily converges to an interior rest point. This seems to be in contrast with previ- ous observation [55], where we believe the authors have confused slow convergence with limit cycles. Furthermore, none of the studies so far have systematically examined the impact of explo- ration, i.e., noise, on the learning dynamics and its asymptotic behavior. On the other hand, noise is believed to be an inherent aspect of learning in humans and animals, either due to softmax selection mechanisms [27], or random perturbations in agent utilities [24]. Here we provide such an analysis, and show that depending on the game, there can be one, two, or three rest points, with a bifurcation between different rest–point structures as one varies the exploration rate. In particular, there is a critical exploration rate above which there remains only one rest point, which is globally stable. In this paper we propose an approach to learning dynamics, based on a novel variant of the replicator equations, that includes turnover of individuals and the resulting differences in 10 experience within a population. After introducing the model, we analyze the effect of turnover in simple evolutionary games: the rock-paper-scissors game and the set of two-action games played between two different populations. We conclude the paper by showing how this approach can provide an interpretation for the observed bid distribution in online lowest unique bid auctions. To describe coupled dynamics of individual attributes and network topology, here we suggest a simple model of co–evolving network that is based on the notion of interacting adaptive agents. Specifically, we propose network–augmented multi–agent systems where the agents play repeated games with their neighbors, and adapt both their behaviors and the network ties depending on the outcome of their interactions. To adapt, the agents use a simple learning mechanism to reinforce (penalize) behaviors and network links that produce favorable (unfavorable) outcomes. Further- more, the agents use an action selection mechanism that allows to control exploration/exploitation tradeoff via a temperature-like parameter. We demonstrate that the collective evolution of such a system can be described by appropri- ately defined replicator dynamics equations. Originally suggested in the context of evolutionary game theory (e.g., see [25, 26]), replicator equations have been used to model collective learning in systems of interacting self–interested agents [46]. Here we provide a generalization where the agents adapt not only their strategies (probability of selecting a certain action) but also their network structure (the set of other agents that play against). Specifically, we derive a system of coupled non-linear equations that describe the simultaneous evolution of agent strategies and network topology. We provide a comprehensive analysis of three-player two-action games, which are the sim- plest systems that exhibit non-trivial structural dynamics. In particular, we characterize the rest- points and their stability properties in these games, and study the effect of exploration on the 11 learning outcomes. Our results indicate that there is a critical exploration rate above which the uniformly connected network is a globally stable outcome of the learning dynamics. A similar transition is shown for a general N-player two actions games. We also establish that for sub- critical exploration rates, the resulting networks are composed of star motifs. The rest of this PhD dissertation is organized as follows: We next describe the connection between Boltzmann Q-learning and replicator dynamics, and elaborate on the non–conservative nature of dynamics for any finite exploration rate. In Section 5.2 we analyze the asymptotic behavior of the learning dynamics as a function of exploration rates for different game types. In Section 5.3 we illustrate our findings on several examples. Then we introduce the learning dynamics– an extension of replicator dynamic– with in the present of naive players in the population and study the dynamics for simple two-action games. We next derive the replicator equations characterizing co-evolution of network structure and strategies of agents. In Section 7.2 we focus on without exploration, describe Nash equilibria of the game, and characterize the rest-points of learning dynamics according to their stability properties. We consider the the impact of exploration on learning in Section 7.3. 12 Part II Background 13 Chapter 4 GAME DEFINITION AND LEARNING IN GAMES Game theory is field of science which study interaction between self-interest players. In a game players can take actions sequentially, which is called extensive game, or they take actions simul- taneously which is called normal game. In this study we focus on normal games. Evolutionary game theory describes a game models in which players strategies evolve by trial and error process where players learn to adopt strategies with better payoff. Following, we review the concept of game and evolutionary dynamics that is used in this work. 4.1 Normal game definition In this work we focus on games where the players choose action simultaneously and at each turn of the game they recipe their payoff. A normal game is a tupleG=(N;A;R) where, N= 1;:::;n is the set of agents in the game A= A 1 A 2 A m is joint set of the actions R=(R 1 ;R 2 ;:::;R n ) is the payoff function where R : A!R 14 Players in this game always can reach one or more than one Nash Equilibrium. There are two types of the Nash Equilibrium (NE): mixed and pure, which define in following section. 4.1.1 Nash Equilibrium of a normal game The a i 2 A denotes strategy of agent i. A strategy of player i , a i 2 A is a (pure) Nash equilibrium if: u i (a i ;a i ) u i (a i ;a i ) (4.1) for all i2 N and a i 2 A , wherei denotes the complementary set of player i. We denote the set of pure Nash equilibria by A . In case of the inequality in inequality 4.1 the strategy is called a strict Nash equilibrium. So far, we talk about actions (pure strategy). Players can also adopt mixed strategy. A mixed strategy in normal games is a probability distribution over actions. LetD(A i ) denote the set of probability distributions over A i . We identify this set with the simplex: a i 2D(A i ) whereS i a i = 1. We refer toa i as a mixed strategy, while the members of A i are pure strategies (or actions). Now we can defined the mixed Nash Equilibrium as u i (a i ;a i ) u i (a i ;a i ) (4.2) for all i2 N anda i 15 4.2 Replicator Dynamics The filed of evolutionary game theory (EGT) is introduced in 1973 by Maynard [36]. EGT then found wide applicability both end of evolutionary science, ecological population dynamics and molecular kinetics. In theoretical biology and economical modeling. Replicator dynamics is one of the main learning mechanism in EGT. In general it has the form of: ˙ x i x i =p i ¯ p (4.3) wherep i is the expected reward of choosing action i (species i in ecology) and ¯ p is the expected average reward. in normal game where the players reward matrix is is A=(a i j ) , (Ax) i is the expected payoff of type i and the average expected payoff is x T Ax, so equation 4.3 converts to ˙ x i x i =(Ax) i x:Ax: (4.4) This is a cubic( order of 3 polynomial) dynamical systems. For an n+1 action game the solutions stay on a n dimensional simplex. Th main idea behind the dynamics is that the rate of choosing a strategy is proportional to its expected benefit. The properties of this dynamic is studied in [25, 47, 63]. The stable solutions of the 4.4 is strict Nash equilibrium of the game. [26]. 16 When there are two type of populations ( with no self interactions) where the interaction is a symmetric, the replicator dynamics in game obtain the form of , ˙ x i x i =(Ax) i x:Ay i= 1;:::;n (4.5) ˙ y j y j =(By) j y:Bx j= 1;:::;m (4.6) This dynamics can show a very rich dynamics for even low number of action in games. This dynamics [23] perseveres volume and if A= cB T then the dynamics can consider a Hamiltonian system. In [45] shows that for a Rock-Paper- Scissors game this dynamics can shows chaotic behavior. 4.3 Quantal Response Equilibria (QRE) and logit equilibrium Recent theoretical advances in game theory tries to relax the classical assumption of rationality of players. One of thee most well known extension of Nash equilibrium is Quantal Response Equilibrium [37]. Definition. In a gameG=(N;A;R) a Quantal Response equilibrium is any strategy such that for any player a the probability of pure strategy i is x a i = f a (r i ). f a is called quantal response function (or reaction function) and is continuous and monoton- ically increases with r i .In particular one class of quantal response function is the logistic quantal response with the following form x i = e r i =T å j e r j =T : (4.7) 17 It has been proved that as T! 0 the logit equilibrium approaches the Nash Equilibrium of the game. The players choose evenly between different pure actions as T!¥. 4.4 Logit equilibrium as fixed point of Learning dynamics One of the pivot research agenda is interoperation of non-equilibrium experimental result from Nash equilibrium in games and viewing equilibrium as an outcome of learning process. The learning model that is closest to the evolutionary approach is reinforcement learning (RL) which bases on the psychological insight that successful strategies is reinforced and used more frequently. We focus more on this type of learning model with Boltzmann update rule. 4.4.1 Perturbed (Smoothed) best response dynamics Let us consider a population of players , update their strategy by choosing the best reply to the mean population strategy. This yields the best response (BR) dynamics [16] ˙ X2 BR(X) X (4.8) Boltzmann function can approximate the Best replies dynamics to shape a learning dynamics, so called logit dynamics [13, 26] ˙ x= e u i =T å j e u j =T x i (4.9) Where T is temperature, a positive number . As T! 0 the logic dynamics solution ap- proaches Nash equilibria. 18 4.4.2 Stochastic evolutionary game theory This approach is introduced by [17]. Let us consider a population with strategy, x(t), at the t, earns ann expected pay offp(x(t);t) where the population distribution of strategies , F(x;t). We adopt a directional learning model where the strategy evolve in the direction proportional to the rate of expected pay offp 0 : dx=p 0 (x;t)dt+m f 0 (x;t), wheres determines the intensity of randomness. It can be shown that this proposes can be a differential equation for population strategy: ¶F(x;t) ¶t =p 0 (x;t) f(x;t)+m f 0 (x;t) (4.10) where m = s 2 =2 represent the noise during learning, and f , f 0 represents the population strategy density and its slop. This is the Fokker Planck equation. In a steady state, we have dF=dt = 0. This fixed point yields a differential equation: f 0 (x)= f(x)=p 0 (x)=m. After integration f(x)= kexp(p(x)=m), which at continuos limit becomes a logic equilibrium or LQRE. 19 Part III Bifurcation in population by learning in games 20 Chapter 5 BOLTZMANN Q-LEARNING DYNAMICS FOR TWO-ACTION TWO-PLAYER GAMES 5.1 Dynamics of Q–Learning Here we provide a brief review of Q-learning algorithm and its connection with the replicator dynamics. 5.1.1 Single Agent Learning In Reinforcement Learning (RL) [52] agents learn to behave near–optimally through repeated interactions with the environment. At each step of interaction with the environment, the agent chooses an action based on the current state of the environment, and receives a scalar reinforce- ment signal, or a reward, for that action. The agent’s overall goal is to learn to act in a way that will increase the long–term cumulative reward. Among many different implementation of the above adaptation mechanisms, here we consider the so called Q–learning [59], where the agents’ strategies are parameterized through Q–functions that characterize relative utility of a particular action. Those Q–functions are updated during the 21 course of the agent’s interaction with the environments, so that actions that yield high rewards are reinforced. To be more specific, assume that the agent has a finite number of available actions, i= 1;2;:::;n, and let Q i (t) denote the Q-value of the corresponding action at time t. Then, after selecting action i at time t, the corresponding Q-value is updated according to Q i (t+ 1)= Q i (t)+a[r i (t) Q i (t)] (5.1) where r i (t) is the observed reward for action i at time t, anda is the learning rate. Next, we need to specify how the agent selects actions. Greedy selection, when the action with the highest Q value is selected, might generally lead to globally suboptimal solution. Thus, one needs to incorporate some way of exploring less–optimal strategies. Here we focus on Boltz- mann action selection mechanism, where the probability x i of selecting the action i is given by x i (t)= e Q i (t)=T å k e Q k (t)=T ; i= 1;2;;n: (5.2) where the temperature T > 0 controls exploration/exploitation tradeoff: for T! 0 the agent always acts greedily and chooses the strategy corresponding to the maximum Q–value (pure exploitation), whereas for T!¥ the agent’s strategy is completely random (pure exploration). We are interested in the continuous time limit of the above learning scheme. Toward this end, we divide the time into intervalsdt, replace t+ 1 with t+dt anda withadt. Next, we assume that within each interval dt, the agent samples his actions, calculates the average reward r i for action i, and applies Eq. 5.1 at the end of each interval to update the Q-values. 1 1 In the terminology of reinforcement learning, this corresponds to an off-policy learning, as opposed to on-policy learning, where one uses Eq. 5.2 and Eq. 5.1 concurrently to sample actions and update the Q-values of those action, respectively (e.g., see [52]). A potential issue with the latter scheme is that actions that are played rarely will be updated rarely, which might be problematics for the convergence of the algorithm. A possible remedy is to normalize 22 In the continuous time limitdt! 0, one obtains the following differential equation describing the evolution of the Q values: ˙ Q i (t)=a[r i (t) Q i (t)] (5.3) Next, we would like to express the dynamics in terms of strategies rather than the Q values. Toward this end, we differentiate Eq. 5.2 with respect to time and use Eq. 5.3. After rescaling the time, t!at=T , we arrive at the following set of equations: ˙ x i x i =[r i å n k=1 x k r k ] T å n k=1 x k ln x i x k : (5.4) The first term in Eq. 5.4 asserts that the probability of taking action i increases with a rate pro- portional to the overall efficiency of that strategy, while the second term describes the agent’s tendency to randomize over possible actions. The steady state strategy profile, x s i , if it exists, can be found from equating the right hand side to zero, which can be shown to yield x s i = e r i =T å k e r k =T : (5.5) We would like to emphasize that x s i corresponds to the so called Gibbs distribution for a statistical– mechanical system with energyr i at temperature T . Indeed, it can be shown that the above replicator dynamics minimizes the following function resembling free energy: F[x]= å k r k x k + T å k x k lnx k (5.6) each update amount by the frequency of corresponding action [33, 52], which can be shown to lead to the same dynamics Eq. 5.3 in the continuous time limit. 23 where we have denoted x=(x 1 ;;x n ), å n i=1 x i = 1. Note that the minimizing the first term is equivalent to maximizing the expected reward, whereas minimizing the second term means maximizing the entropy of the agent strategy. The relative importance of those terms is regulated by the choice of the temperature T . We note that recently a free energy minimization principle has been suggested as a framework for modeling perception and learning (see [12] for a review of the approach and its relation to several other neurobiological theories). 5.1.2 Two-agent learning Let us now assume there are two agents that are learning concurrently, so that the rewards received by the agents depend on their joint action. The generalization to this case is introduced via game- theoretical ideas [26]. More specifically, let A and B be the two payoff matrices: a i j (b i j ) is the reward of the first (second) agent when he selects i and the second (first) agent selects j. Furthermore, let y=(y 1 ;;y n ),å n i=1 y i = 1, be the strategy of the second agent. The expected rewards of the agents for selecting action i are as follows: r x i = å n j=1 a i j y j ; r y i = å n j=1 b i j x j (5.7) The learning dynamics in two-agent scenario case is obtained from Eq. 5.4 by replacing r i with r x i and r y i for the first and second agents, respectively, which yields ˙ x i = x i [(Ay) i x Ay+ T Xå j x j ln(x j =x i )] (5.8) ˙ y i = y i [(Bx) i y Bx+ T Yå j y j ln(y j =y i )] (5.9) 24 where (Ay) i is the i element of the vector Ay, and we assume that the exploration rates T X and T Y of the agents can generally be different. This system (without the exploration term) is known as bi–matrix replicator equation [23, 26]. Its relation to multi–agent learning has been examined in [1, 14, 45, 46, 56]. Before proceeding further, we elaborate on the connection between the rest-points of the replicator system Eqs. 5.8, 5.9, and the game-theoretic notion of Nash Equilibrium (NE), which is a central concept in game theory. Recall that a joint strategy profile(x ;y ) is called NE if no agent can increase his expected reward by unilaterally deviating from the equilibrium. It is known that for T X = T Y = 0, all the NE of a game are also rest-points of the dynamics [26]. The opposite is not true – not all the rest points correspond to NE. Furthermore, some NE might correspond to unstable rest points of the dynamics, which means that they cannot be achieved by the learning process. For any finite T X ;T Y > 0, the rest points will be generally different from the NE of the game. In the limit T X ;T Y !¥, agents are insensitive to the rewards and mix uniformly over the actions. In this work we study the behavior of the learning dynamics in the intermediate range of exploration rates. 5.1.3 Exploration causes dissipation It is known that for T X = T Y = 0 the system of Eqs. 5.8, 5.9 are conservative [23, 26], so that the total phase space volume is preserved. It can be shown, however, that any finite exploration rate T X ;T Y > 0 makes the system dissipative or volume contracting [46]. While this fact might not be crucial in high–dimensional dynamical system, its implications for low–dimensional system, and specifically for two–dimensional dynamical system considered here are crucial. Namely, the finite dissipation rate means that the system cannot have any limit cycles, and the only possible 25 asymptotic behavior is a convergence to a rest point. Furthermore, in situation when there is only one interior rest point, it is guaranteed to be globally stable. To demonstrate the dissipative nature of the system for T X ;T Y > 0, it is useful to make the following transformation of variables u k = ln x k+1 x 1 ; v k = ln y k+1 y 1 ; k= 1;2;;n 1: (5.10) The replicator system in the modified variables reads [23, 46] ˙ u k = å j ˜ a k j e v j 1+å j e v j T X u k ; ˙ v k = å j ˜ b k j e u j 1+å j e u j T Y v k (5.11) where ˜ a k j = a k+1; j+1 a 1; j+1 ; ˜ b k j = b k+1; j+1 a 1; j+1 (5.12) Let us recall the Liouville formula: If ˙ z= F(z) is defined on the open set U inR n and if G U has volume V(t) of G(t)=fx(t) : x2 Gg, then the rate of change of a volume V , which contain of set of points G in the phase space is proportional to the divergence of F [25]. Consulting with Eqs. 5.11, we observe that the dissipation rate is given by [46] å k ¶ ˙ u k ¶u k + ¶ ˙ v k ¶v k (T X + T Y )(n 1)< 0 (5.13) As we mentioned above, the dissipative nature of the dynamics has important implications for two-action games that we consider next. 26 5.1.4 Two–action games Let us consider two action games, and let x and y denote the probability of selecting the first action by the first and second agents, respectively. Then the learning dynamics Eqs. 5.8, 5.9 attain the following form: ˙ x x(1 x) =(ay+ b) ln x 1 x ; (5.14) ˙ y y(1 y) =(cx+ d) ln y 1 y (5.15) where we have introduced a= a 21 + a 12 a 11 a 22 T X ; b= a 12 a 22 T X (5.16) c= b 21 + b 12 b 11 b 22 T Y ; d= b 12 b 22 T Y (5.17) The vertices of the simplexfx;yg=f0;1g are rest points of the dynamics. For any T X ;T Y > 0, those rest points can be shown to be unstable. This means that any trajectory that starts in the interior of the simplex, 0< x;y< 1, will asymptotically converge to an interior rest point. The position of those rest points is found by nullifying the RHS of Eqs. 5.14, 5.15. For the remaining of this paper, we will examine the interior rest point equations in details. 27 5.2 Analysis of Interior Rest Points 5.2.1 Symmetric Equlibria First, we consider the case of symmetric equilibria, x= y and T X = T Y = T , in which case the interior rest point equation is ax+ b= ln x 1 x (5.18) Graphical representation of Eq. 5.18 is illustrated in Fig. 10.1 where we plot both sides of the equation as a function of x. First of all, note that the RHS of Eq. 5.18 is a monotonically increasing function, assuming values in(¥;¥) as x changes between(0;1). Thus, it is always guaranteed to have at least one solution. Further inspection shows that the number of possible rest points depends on the type of the game as well as the temperature T . 0 0.2 0.4 0.6 0.8 1 −4 −2 0 2 4 x T >T c T =T c T <T c Figure 5.1: The graphical illustration of the rest point equation for the symmetric case, Eq. 5.18. The solid curve corresponds to the RHS, and the three lines correspond to the LHS for subcritical, critical and supercritical temperature values, respectively. For instance, there is a single solution whenever a 0, for which the LHS is a non–increasing function of x. 28 Next, we examine the condition for having more than one rest point, which is possible when a> 0. Consult with Fig. 10.1: For sufficiently large temperature, there is only a single solution. When decreasing T , however, a second solution appears exactly at the point where the LHS becomes tangential to the RHS. Thus, in addition to Eq. 5.18, at the critical temperature we should have a= 1 x(1 x) ; (5.19) or, alternatively, x= 1 2 1 r 1 4 a (5.20) Note that the above solution exists only when a 4. Plugging 10.22 into 5.18, we find b= ln aa aa 1 2 (aa); a = p a 2 4a (5.21) Thus, for any given a 4, the rest point equation has three solutions whenever b c < b < b + c , where b + c = ln aa a+a aa 2 ;b c = ln a+a aa a+a 2 (5.22) For small values of T when a is sufficiently large (and positive), the two branches b c and b + c are well separated. When one increases T , however, at some critical value those two branches meet and a cusp bifurcation occurs [51]. The point where the two bifurcation curves meet can be shown to be (a;b)=(4;2), and is called a cusp point. A saddle-node bifurcation occurs all along the boundary of the region, except at the cusp point, where one has a codimension-2 29 bifurcation - i.e., two parameters have to be tuned for this type of bifurcation to take place [51]. This boundary in the parameter space is shown in Fig. 5.2. 4 6 8 10 −7 −6 −5 −4 −3 −2 −1 a b b − c (a) b + c (a) Bistable Figure 5.2: Demonstration of the cusp bifurcation in the space of parameters a and b for symmet- ric equilibria. 5.2.2 General Case We now examine the most general case. We find it useful to introduce variables u = ln x 1x , v= ln y 1y . Then the interior rest point equations can be rewritten as u= b+ a 1 1+ e v ; v= d+ c 1 1+ e u (5.23) where a, b , c, and d have been already defined in Eqs. 7.15, 7.16. Eliminating v we obtain 1 a u b a = 1+ e d c 1+e u 1 g(u): (5.24) 30 !10 !5 0 5 10 !1 !0.5 0 0.5 1 1.5 u 10 !5 0 5 10 1 0.5 0 0.5 1 1 a u" b a :a<0 g(u):c< 0 g(u):c> 0 Figure 5.3: (Color online) Graphical representation of the general rest point equation for two different values of c : Intersections represent rest points. The solution of Eq. 5.24 are the rest point(s) of the dynamic. Its graphical representation is shown in Fig. 7.4(a). It is easy to see that 0< g(u)< 1. Furthermore, we have from Eq. 5.24 g 0 (u)= cg(1 g) 1 4cosh 2 u 2 (5.25) Thus, g(u) is a monotonically increasing (decreasing) function whenever c> 0 (c< 0). Next, we classify the games according to the number of rest points they allow. Let us consider two cases: ı) ac< 0: Note that in Eq. 5.24 the LHS is a monotonically increasing (decreasing) function for a> 0 (a< 0). As stated above, RHS is also a monotonically increasing (decreasing) function whenever c> 0 (c< 0). Consequently, whenever a and c have different signs, i.e. ac< 0, one of the sides is a monotonically increasing function while the other is a monotonically decreasing; 31 thus, there can be only one interior rest point, which, due to the dissipative nature of the dynamics, is globally stable. An example of this class of game is Matching Pennies that will be discussed in Section 5.3. ıı) ac> 0: In this case it is possible to have one, two or three interior rest points. For the sake of concreteness, we focus on a> 0, c> 0, so that both the LHS and RHS of Eq. 5.24 are monotonically increasing functions. Recall, that at the critical point when the second solution appears, the LHS of Eq. 5.24 should be tangential to g(u). Consider now the set of all tangential lines to g(u) in Eq. 5.24, and letd min andd max be the minimum and maximum value of the intercepts among those tangential lines for any u and T Y . The intercept of the line given by the LHS of Eq. 5.24, on the other hand, equals b a , and is independent of the temperature. It is straightforward to check that multiple rest points are possible only whend min < b a <d max . A full analysis along those lines (see Appendix 8.1) reveals that the number of possible rest points depend on the ratios b a and d c , as depicted in Fig. 5.4. First, consider the parameter range 0< b a ; d c < 1 (shaded light-grey region in Fig. 5.4), which correspond to so called coordination games that have three NE. The learning dynamics in these games can have three rest points, that intuitively correspond to the perturbed NE. In particular, those rest points will converge to the NE as the exploration rates vanish. When a;c< 0, the parameter range 0< b a ; d c < 1 corresponds to so called anti-coordination games. Those games also have three NE, so the learning dynamics can have three rest points. Let us now focus on light grey (not-shaded) regions in Fig. 5.4. The games in this parameter range have a single NE. At the same time, the learning dynamics might still have multiple rest points. Those additional rest-points exist only for a range of exploration rates, and disappear 32 when both exploration rates T X ;T Y are sufficiently low or sufficiently high; see Appendix 8.2 for details. An example of this type game will be presented in Section 5.3. " 1 2 " 1 2 "1 "1 d c b a Figure 5.4: (Color online) Characterization of different games in the parameter space with a;c> 0. Dark blue region corresponds to games that can have only a single rest point, whereas the games in the light grey regions can have three rest-points. The shaded grey square corresponds to games that have three Nash equilibria. Note that the Fig. 5.4 was obtained by assuming that T X and T Y are independent parameters. Assuming some type of functional dependence between those two parameters alters the above characterization. For instance, consider the case T X = T Y = T . At the critical point we have (in addition to Eq. 5.24) ag 0 (u)= 1, which yields ac= 4cosh 2 u 2 g(1 g) (5.26) 33 It can be shown 2 that when T X = T Y = T the above conditions can be met only when 0< b a < 1, 0 < d c < 1 (shaded region in Fig. 5.4), which correspond to the domain of multiple NE: coordination (a;c> 0) and anti–coordination (a;c< 0) games. It is illustrative to write Eq. 5.26 in terms of the original variables x and y: ac= 1 x(1 x)y(1 y) (5.27) It can be seen that Eq. 10.21 is recovered when a= c and x= y. Furthermore, since 0< x;y< 1, the above condition can be satisfied only when ac 16. Linear Stability Analysis We conclude this section by briefly elaborating on the dynamic sta- bility of the interior rest points. Note that, whenever there is a single rest point it will be globally stable due to the dissipative nature of the dynamics. Thus, we focus on the case when there are multiple rest points. For the interior rest points, the eigenvalues of the Jacobian of the dynamical system Eqs. 5.14,5.15 are as follows: l 1;2 =1 p acy(1 y)x(1 x) (5.28) Let us focus on symmetric games and symmetric equilibria (i.e. x = y). From Eq. 5.28 we find the eigenvalues l 1;2 =1 ax 0 (1 x 0 ), so that the stability condition is ax 0 (1 x 0 ) < 1. Recalling that at the critical point we have a= 1 x 0 (1x 0 ) , it is straightforward to demonstrate that for 2 Indeed, substituting g(u) from Eq. 5.24 into Eq. 5.26 one formally obtains a quadratic equation for T , AT 2 + BT + C = 0, A = cosh 2 (u=2) a 0 c 0 + u 2 , B = (1+ 2 b 0 a 0 ) u a 0 , C = (1+ b 0 a 0 ) b 0 a 0 where: a 0 = a 21 + a 12 a 11 a 22 and c 0 = b 21 + b 12 b 11 b 22 , b 0 = a 12 a 22 . Requiring that T is a real positive number yields 4AC< 0, or 0<b=a< 1. With the similar reasoning the domain of d=c of multiple intersection is 0<d=c< 1. 34 x y 0 1 1 (a) x y 1 0 1 (b) Figure 5.5: (Color online) Illustration of dynamical flow for a system with three (a) and single (b) rest points. Note that the middle rest point in (a) is unstable. the middle rest–point the above condition is always violated, meaning that it is always unstable. Similar reasoning shows that two other rest points are locally stable, and depending on the starting point of the learning trajectory, the system will converge to one of the two points. An example of the flows generated by the dynamics for below–critical and above–critical exploration rates is depicted in Fig. 5.5(a) and 5.5(b). 5.3 Examples We now illustrate the above findings on several games shown in Fig. 5.6. The row (column) number corresponds to the actions of the first (second) agent. Each cell contains a reward pair (a i j ;b ji ), where a i j and b ji are the corresponding elements of the reward matrices A and B. Our first example is the Prisoner’s Dilemma (PD) where each player should decide whether to Cooperate (C) or Defect (D). An example of a PD payoff matrix is shown in Fig. 5.6. In PD the defection is a dominant strategy – it always yields a better reward regardless of the other player choice. Thus, even though it is beneficial for the players to cooperate, the only Nash equilibrium 35 !"#$%&'"($) *#+',,-) .) *) .) /0102) /3142) *) /4132) /5152) .%%"6#&-7%&) 8-,') 9) :) 9) /;1;2) /3102) :) /0132) /5152) :-<=>*%?') 8-,') :) *) :) />01>02) /@1>@2) *) />@1@2) /3132) A-BCD#&E) !'&&#'$) :) F) :) /@132) /31@2) F) /31@2) /@132) Figure 5.6: Examples of reward matrices for typical two-action games. of the game is when both players defect. For T X = T Y = 0, the dynamics always converges to the NE. In our PD example we have b a = d c =2, so according to Fig. 5.4 there is a single interior rest point for any T X ;T Y > 0. Furthermore, due to the dissipative nature of the dynamics, the system is guaranteed to converge to this rest point for any finite exploration rates. Note that this is in stark contrast from the behavior of e-greedy learning reported in [62], where the authors observed that, starting from some initial conditions, the dynamics might never converge, instead alternating between different strategy regimes. The lack of convergence and chaotic behavior in their case can be attributed to the hybrid nature of the dynamics. Next, we consider Matching Pennies (MP), which is a zero sum game where the first (second) player wins if both players select the same (different) actions; see Fig. 5.6. This game does not have any pure NE, but it has a mixed NE at x = y = 1 2 . This mixed NE is a rest point of the learning dynamics at T X = T Y = 0 which is a center point surrounded by periodic orbits [23]. For this game we have ac< 0. Thus, there can be only one interior rest point, which can be globally stable for any T X ;T Y > 0. Furthermore, a particular feature of this game is that finite T X ;T Y does not perturb the position of the rest-point (since the entropic term is zero for x= y= 1 2 ). 36 We now consider a coordination game (shaded area in Fig. 5.4) where players have an in- centive to select the same action. In the example shown in Fig. 5.6, the players should decide whether to hunt a stag (S) or a hare (H). This game has two pure NE, (S,S) and (H,H), as well as a mixed NE at (x ;y )=( b a ; d c ), which, for the particular coordination game shown in Fig. 5.6, yields x = y = 2=5. For sufficiently small exploration rates, the learning dynamics has three rest points that intuitively correspond to the three NE of the game. Furthermore, the rest points corresponding to the pure equilibria are stable, while the one corresponding to the mixed equilibrium is unstable. When increasing the exploration rates, there is a critical line (T c X ;T c Y ) so that for any T X > T c X ;T Y > T c Y only one of the rest points survives. In Fig. 5.7 we show the bifurcation di- agram on the plane T X = T Y . 3 We find that most coordination games are characterized by a discontinuous pitch-fork bifurcation (see Fig. 5.7(a)), where above the critical line the surviving rest point correspond to the risk-dominant NE 4 . There is an exception, however, for games with b a + d c =1. This condition describes games where none of the pure NE are strictly risk dominant, and where the mixed NE satisfies x + y = 1. The rest point structure undergoes a continuous pitchfork bifurcation as shown Fig. 5.7(b) whenever a= c and b a + d c =1. One can show that when the above condition is met, the critical point u 0 that satisfies g 0 (u 0 )= 1 a , 1 a u 0 b a = g(u 0 ), is also the inflection point of g(u), g 00 (u 0 )= 0. The other class of two-action games with multiple NE are so-called anti–coordination games where it is mutually beneficial for the players to select different actions. In anti–coordination games, one has a;c< 0 whereas 0< b a < 1, 0< d c < 1. A popular example is the so called 3 We find that the bifurcation structure is qualitatively similar for the more general case T X 6= T Y . 4 In a general coordination game, the strategy profile (1,1) is risk dominant if (a 12 a 22 )(b 12 b 22 )(a 21 a 11 )(b 21 b 11 ). In symmetric coordination games (i.e., as shown in Fig. 5.4) the strategy profile is risk-dominant if it yields a better payoff against an opponent that plays a uniformly mixed strategy. 37 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 T x T c Unstable Stable (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 T x T c Stable Unstable (b) Figure 5.7: (Color online) Bifurcation diagram of the rest points for T X = T Y = T : (a) Discon- nected pitchfork, with mixed NE:(x ;y )=(2=5;2=5) (b) Continuous pitchfork, with mixed NE: (x ;y )=(1=3;2=3). Hawk-Dove game where players should choose between an aggressive (H) or peaceful (D) be- havior. This game has two pure NE, (H,D), (D,H), and a mixed NE at(x ;y )=( b a ; d c ). An example is shown in Fig. 5.6 with a mixed NE at x = y = 1=3. Anti-coordination games have similar bifurcation structure compared to the coordination games. Namely, there is a critical line(T c X ;T c Y ) so that for any T X > T c X ;T Y > T c Y only a single rest 38 point survives. As in the coordination games, the bifurcation is discontinuous for most parame- ter values. The condition for continuous pitch-fork bifurcation in the anti-coordination games is given by a= c and b a = d c . Thus, those games have a symmetric NE x = y . Furthrmore, the critical point where the second solution appears is also the inflection point of g(u), g 00 (u 0 )= 0. Finally, let us consider the games with a single NE, for which the learning dynamics can still have multiple rest points. To be specific, we focus on the case a;c> 0, for which the possible regimes are outlined in Fig. 5.4. In Fig. 5.8(a) , we show the dependence of the rest point structure on the parameter T X , for two different values of T Y , for b a = 0:1, d c =0:8. It can be seen that for sufficiently small T X , the learning dynamics allows a single rest point (that corresponds to the NE of the game). Similarly, there is single rest points whenever T Y is sufficiently hight. However, there is a critical exploration rate for agent Y , T c Y , so that for any 0< T Y < T c Y , there is a range T c X (T Y )< T X < T c + X (T Y ), for which the dynamics allows three rest points. In contrast to coordination and anti–coordination games considered above, those additional rest points do not correspond to any NE of the game. In particular, they disappear when T X ;T Y are sufficiently small. We elaborate more on the appearance of those rest points in Appendix 8.2. Fig. 5.8(b) shows the bifurcation diagram for the same game but plotted against T Y . Note that the two diagrams are asymmetric. In particular, in contrast to Fig. 5.8(a), here multiple solutions are possible even when T Y is arbitrarily small (provided that T c X (T Y ) < T X < T c + X (T Y )). This asymmetry is due to the fact that the agents’ payoff matrices represent different games. In this particular case, the first player’s payoff matrix corresponds to a dominant action game, whereas the second player’s payoff matrix corresponds to a coordination game. Clearly, when T X is very small, the first player will mostly select the dominant action, so there can be only a single rest 39 0 0.6 1.2 0.5 1 T X x T Y <T Y c T Y >T Y c T X c " T X c + (a) 0.15 0.3 0.5 1 T Y x T X c " <T X <T X c + T Y c (b) Figure 5.8: (Color online) Bifurcation in the domain of the games with a;c> 0, b a > 0, 1 2 > d c > 1. In this example we have: d c =0:8; b a = 0:1: a) Rest point structure plotted against T X for T Y < T c Y and T Y > T c Y . b) The rest point structure plotted against T Y for T c X < T X < T c + X (T Y ). In both graphs, the red dot-dashed lines correspond to the unstable rest points. point at small T X . Increasing T X will make the entropic term more important, until at a certain point, multiple rest points will emerge. 40 The same picture is preserved for the parameter range b a <1; 1 2 < d c < 0 (the other light grey horizontal stripe). On the other hand, the players effectively exchange roles in the parameter ranges corresponding to the vertical stripes: d c > 0;1< b a < 1 2 and d c <1; 1 2 < b a < 0. In this case, there is a critical exploration rate T c X , so that for any 0 < T X < T c X , there is a range T c Y (T X )< T Y < T c + Y (T X ), for which the dynamics allows three rest points. Finally, we note that the rest point behavior is different in the light grey regions where the parameters are also confined to b a > 0; d c <1 and b a <1; d c > 0. In those regions, multiple rest points are available only when both T X and T Y are strictly positive, i.e., T c X > 0, T c Y > 0. 41 Chapter 6 LEARNING WITH NAIVE PLAYERS FOR TWO-ACTION TWO-PLAYER GAMES 6.1 Model We 1 aim at generalizing the replicator equations (1.1) to situations in which agents are replaced by inexperienced individuals at a rate p. Let a large population of players engage in a game with strategies labeled i=f1;2;3;:::g. We divide the players in experiences classes: let n i (t;t) be, at time t, the number of players having been in the game for a time t and playing strategy i. The normalization condition for the n’s reads: å i ¥ å t=0 n i (t;t)= N 8t; (6.1) where N is a fixed total population size. We also define the total fraction of players adopting strategy i at time t, x i (t) =å ¥ t=0 n i (t;t)=N. As in the standard replicator equation (1.1), we 1 This section was accomplished by collaborating with Jeppe Juul, at Niels Bohr institute; Sebastian Bernhardsson in Swedish Defense Research Agency, and Simone Pigolotti in Dept. de Fisica i Eng. Nuclear, Universitat Politecnica de Catalunya. 42 introduce the average payoff of strategy i,p i (x(t)), and the average payoff across strategies ¯ p = å i x i p i . For the sake of simplicity, we consider a simple adaptive dynamics in which individuals learn of alternative strategies and their average payoff at a rate equal to the fraction of the population that plays by this strategy. Further, an individual playing strategy i learning of a strategy j with a higher average payoff, changes to strategy j with a probability proportional to the payoff differ- ence. Combining these assumptions gives an overall rate of change from strategy i to strategy j equal to n i x j (p j p i ), ifp j >p i . Furthermore, agents leave the population at rate p and are replaced by inexperienced agents. The new agents play each strategy i with a probability proportional to a given distribution x 0 i . We wish to study the time development of the n i (t;t). The learning dynamics encoded in the rules above reads: n i (t+1;t+1) i n i (t;t)=(1 p) å j:p j <p i n j (t;t)x i (t)(p i p j ) å j:p j >p i n i (t;t)x j (t)(p j p i ) ! pn i (t;t); (6.2) where the first sum represents players changing to strategy i from strategies with lower payoffs, and the second sum represents players changing from strategy i to strategies with higher payoffs. Performing a continuous time limit, the left hand side of the above equation becomes¶ t n i (t;t)+ ¶ t n i (t;t). We can now integrate the continuous time version of Eq. (6.2) over t and divide by the population size N, to obtain a closed evolution equation for the strategies x i (t). Let us recall that n i (0;t)=sNx 0 i , where the proportionality constant s can be determined by imposing that 43 the final equation preserves the normalization condition. We also assume that, due to the effect of turnover, one has lim t!¥ n i (t;t)= 0 8i;t. After combining the two sums, this results in d dt x i (t)=(1 p) å j x j (t)x i (t)(p i p j )+sx 0 i px i : (6.3) Imposing the normalization results ins = p. This also implies that the density of players having a given experience level at steady state is exponentially distributed,å i n i (t;¥)= N pexp(pt). If we furthermore use the normalization condition å j x j = 1 and the definition of the average payoff ¯ p, we get d dt x i =(1 p)x i (p i ¯ p)+ p(x 0 i x i ): (6.4) Upon rescaling time by 1 p and define a rescaled turnover ratec= p=(1 p), we finally obtain d dt x i = x i (p i ¯ p)+c(x 0 i x i ): (6.5) Eq. (6.5) constitutes the starting point of our analysis. It should be clear that our specific choice of adaptive dynamics is not crucial for deriving Eq. (6.5), and that the same macroscopic limit could be obtained for other microscopic adaptation rules leading to the replicator equation (1.1) in the absence of turnover. Eq. (6.5) can be of course also derived (or justified) heuristically, for example by making an analogy with a (damped) driven dynamical system, where the prior distribution x 0 i in the right hand side acts as a forcing term. Note that Eq. (6.5) can be formally recast in replicator form: d dt x i = x i (p i ¯ p)+c(x 0 i x i ) x i ( ˜ p i ¯ p); (6.6) 44 where the “effective payoff” of strategy i, ˜ p i , is defined as ˜ p i =p i +c x 0 i x i 1 : (6.7) In Eq. (6.6) the average payoff ¯ p does not contain a contribution from the second term in Eq. (6.7) as its average is zero, so thatå i x i p i =å i x i ˜ p i . The mapping in Eq. (6.6) is valid only in the interior of the simplex x i > 08i;å i x i = 1, while at the boundary the effective payoffs diverge. This situation has some resemblance to the case of evolutionary dynamics in the presence of mutations [22, 53]. The main difference is that in our case the “mutants” are not characterized by new, pure strategies or random strategies, but the mixed distribution x 0 i of the existing strategies. The fact that the replicator equation with turnover (6.5) can be rewritten in replicator form (6.6) implies that the two equations share several mathematical properties. Among these is that the mean effective payoff for the population can not decrease along any trajectory. If the pay- off function is continuous and bounded, it therefore serves as a Lyapunov function [50], which guarantees that all game dynamics either evolve to an equilibrium point or a closed orbit. For the replicator equation without turnover, c = 0, all Nash equilibria of a game are equilibrium points of the dynamics [26]. For positivec, the equilibrium points are generally different from the Nash equilibria, and we call these the turnover equilibria of the game. In the limit c!¥, agents are insensitive to the rewards and the initial strategy x 0 i is the only equilibrium point. 6.2 Two-agent two-action games We now turn to the broad class of games, where two populations can both choose between two strategies. Let us define the mixed strategies as(x;1 x) for the first population and(y;1 y) for 45 the second population. The payoffs obtained by the two populations depend on their joint actions [26] and are traditionally encoded in payoff matrices A and B: 0 B B @ p x 1 p x 2 1 C C A = 0 B B @ a 11 a 12 a 21 a 22 1 C C A 0 B B @ y 1 y 1 C C A (6.8) 0 B B @ p y 1 p y 2 1 C C A = 0 B B @ b 11 b 12 b 21 b 22 1 C C A 0 B B @ x 1 x 1 C C A ; (6.9) where p x 1 is the payoff of the first strategy for the first population, and so forth. The two popu- lations are learning concurrently and their turnover rates c x and c y can in principle be different. The learning dynamics can be obtained from Eq. (6.5). d dt x= x(1 x)(p x 1 p x 2 )+c x (x 0 x) (6.10) d dt y= y(1 y)(p y 1 p y 2 )+c y (y 0 y) (6.11) where the equations for the second strategies can be simply obtained from the normalization conditions. In the absence of turnover, this system is known as a bi-matrix replicator equation [23, 26]. In the turnover equilibrium, which we denote as (x;y)=(x ;y ), the time derivatives are equal to zero. Upon substituting the expressions (6.8) and (6.9) for the payoffs leads to the equations ay + a 12 a 22 =c x x x 0 x (1 x ) ; (6.12) bx + b 12 b 22 =c y y y 0 y (1 y ) ; (6.13) 46 where we have introduced a = a 11 + a 22 a 21 a 12 ; (6.14) b = b 11 + b 22 b 21 b 12 : (6.15) The number of solutions to equations (6.12) and (6.13) depends on the sign of the product ab. When ab < 0, there is always one mixed turnover equilibrium. When ab > 0, the number of turnover equilibria can increase at critical values of the turnover parameter. The derivation of these result can be found in appendix 9.1. In the following, we provide examples for two well-known two-action games belonging to the two categoriesab < 0 andab > 0: the game of matching pennies and a coordination game, respectively. 6.2.1 Matching pennies A simple example of ab < 0 is the zero sum game of matching pennies. Two players secretly turn a penny each to heads or tails and reveal the coins simultaneously. If the pennies match, player one receives a reward r from player two. If they do not, player one pays the reward to player two. This game is characterized by the payoff matrices A= 0 B B @ r r r r 1 C C A B= 0 B B @ r r r r 1 C C A : (6.16) The learning dynamics (6.10) and (6.11) with no turnover leads to a marginally stable Nash equilibrium surrounded by concentric closed orbits (see Fig. 6.1A). When a positive turnover of 47 0 1 2 3 4 5 0 1 2 3 4 5 χ x -0.2 -0.1 0 0.1 0.2 χ y Turnover equilibrium (x*, y*) Initial strategy (x , y ) Flow of dynamics Flow from initial strategy 1 0 1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x y A 0 0.2 0.4 0.6 0.8 1 x B C Figure 6.1: (Color online) Dynamics of the matching pennies game with r= 2 and initial strategy x 0 = y 0 = 0:3. A For c = 0, the average strategy oscillates at a fixed distance from the Nash equilibrium. B When a turnover is introduced, the average strategy converges to the turnover equilibrium. Herec x =c y = 1. C Expected payoff for population one in the turnover equilibrium for varying turnovers. Above a critical value ofc y = 2, the first population will win most games. players is introduced, the Nash equilibrium is perturbed to a stable turnover equilibrium(x ;y ), as displayed in Fig. 6.1B. Inserting the payoff matrices into the conditions for turnover equilib- rium (6.12) and (6.13) gives y = 1 2 + c x 4r x x 0 x (1 x ) ; (6.17) x = 1 2 + c y 4r y y 0 y (1 y ) ; (6.18) 48 Matching pennies is a zero sum game. The expected payoff of population one is positive if the two populations have a tendency of both playing heads or both playing tails, while it is zero if and only if one of the populations chooses their strategy randomly. From Eq. (6.17) and (6.18) we see that this happens in the turnover equilibrium if(x ;y )=(1=2;y 0 ) or(x ;y )=(x 0 ;1=2). For a given set of initial strategies of the populations, this corresponds to the critical turnover rates c x;c =r 1 2y 0 1 2x 0 (6.19) c y;c = r 1 2x 0 1 2y 0 : (6.20) Equation (6.19) can only be satisfied if population two wins most games when the initial strate- gies are employed. In this case, the expected payoff of population one is negative if it has a high turnover, since its equilibrium strategy is close to its initial strategy. If the turnover is decreased below the critical value (6.19), the first population starts winning more games than it loses re- gardless of the turnover of population two. Even if population two has an even lower turnover, and thus has an initially superior strategy and more time to gain experience, the equilibrium state remains advantageous for population one. Likewise, equation (6.20) only gives a positive critical turnover if population one wins most games when the initial strategies are employed. Here, the payoff for population two goes from being positive to negative when its turnover is increased past the critical value given by (6.20), regardless of the turnover of the first population. In Fig. 6.1C, the average payoff for population one is shown for varying turnovers of both populations and the initial strategies x 0 = y 0 = 0:3. 49 In this case, if the turnover of population two is larger than c y;c = 2, population one wins most games. 6.2.2 Coordination game When the payoff matrices of the two populations equal each other, A= B, one must necessarily have ab 0. One example of this is a coordination game, where the two players strive to play the same strategy. We consider the payoff matrices A= B= 0 B B @ 6 0 3 2 1 C C A ; (6.21) where a = b = 5. Without player turnover, this game has two pure Nash equilibria where the populations employ the same strategy. In addition, there is a mixed Nash equilibrium at (x ;y )=(2=5;2=5) with one stable and one unstable manifold. The stable manifold constitutes the boundary between the basins of attraction of the two pure Nash equilibria (see Fig. 6.2A). For sufficiently small turnovers, there are three turnover equilibria close to these three Nash equilib- ria. The initial strategies(x 0 ;y 0 ) determine which equilibrium state the system goes to. Let us consider a situation of strong initial disagreement(x 0 ;y 0 )=(0:9;0:1), where inexperi- enced players of population one and two tend to employ the first and second strategy, respectively. We fix c y = 2, while varying c x as a control parameter. In this case, the point(x;y)=(0:4;0:9) is always an equilibrium of the dynamics. At a critical value, c x 0:71, the boundary between the basins of attractions passes the initial strategy(x 0 ;y 0 ) (see Fig. 6.2B-C), and the steady state 50 changes discontinuously from being dominated by strategy two to strategy one. At the same point, the payoff of both populations increases drastically. Upon increasing the turnover of the first population even further, the basin of attraction of strategy two suddenly disappears at c x 0:91. For this value, the saddle node equilibrium anni- hilates with the turnover equilibrium of strategy two (see Fig. 6.2D-E). For larger values of c x , there is only one turnover equilibrium. In this case the number of turnover equilibria changes through a saddle node bifurcation, but for other payoff matrices A, B, a pitchfork bifurcation could occur (see appendix 9.1). A full bifurcation diagram is shown in Fig. 6.2F, and the expected payoff of population one is shown in Fig. 6.2G as a function of the turnover rates. The abrupt change in payoff at a critical set of turnover rates can clearly be seen. The expected payoff of population two follows a similar pattern. In general, a low turnover of either population results in higher payoffs of both populations, since a larger proportion of the players has enough experience to mainly bid on the dominating strategy. However, increasing the turnover may change which equilibrium the system goes to, leading to an increase in the payoff of both players. 51 0 1 2 3 4 5 0 1 2 3 4 5 χ x 2 3 4 5 6 χ y 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x y 0 0.2 0.4 0.6 0.8 1 x 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x y 0 0.2 0.4 0.6 0.8 1 x 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x y Turnover equilibrium (x*, y*) Initial strategy (x , y ) Basin of equilibrium 1 Basin of equilibrium 2 Flow of dynamics Flow from initial strategy Stable / unstable manifold 1 0 1 0 A B C D E F panel C panel B Figure 6.2: (Color online) Dynamics of a coordination game with initial strategy (x 0 ;y 0 ) = (0:9;0:1). A Without turnover, the game has one saddle point Nash equilibrium with stable manifolds that separate the basins of attraction of two pure Nash equilibria. B-C When the initial strategy goes from one basin of attraction to another, the resulting equilibrium state changes dis- continuously. Here c y = 2 and c x = 0:7 and 0:72, respectively. D-E At a critical set of turnover rates two turnover equilibria annihilates in a saddle node bifurcation. Herec y = 2 andc x = 0:91 and 0:92, respectively. F Bifurcation diagram for turnover equilibria. The point(x;y)=(0:4;0:9) is always an equilibrium. We observe a transcritical bifurcation at c x 0:4 and the saddle node bifurcation from panels D-E at c x 0:9. G Expected payoff of population one in the turnover equilibrium for varying turnovers. The dramatic change in payoffs between panel B and C can clearly be seen. 52 Part IV Co-evolution of networks and players strategy 53 Chapter 7 ADAPTING AGENTS ON EVOLVING NETWORKS 7.1 Co-Evolving Networks via Reinforcement Learning Let us consider a set of agents that play repeated games with each other. We differentiate agents by indices x;y;z;:::. At each round of the game, an agent has to choose another agent to play with, and an action from the pool of available actions. Thus, time–dependent mixed strategies of agents are characterized by a joint probability distribution over the choice of the neighbors and the actions. We assume that the agents adapt to their environment through a simple reinforcement mech- anism. Among different reinforcement schemes, here we focus on (stateless) Q-learning [59]. Within this scheme, the agents’ strategies are parameterized through, so called Q–functions that characterize relative utility of a particular strategy. After each round of game, the Q functions are updated according to the following rule, Q i xy (t+ 1)= Q i xy (t)+a[R i x;y (t) Q i xy (t)] (7.1) 54 where R i x;y (Q i x;y ) is the expected reward (Q value) of agent x for playing action i with agent y, anda is a parameter that determines the learning rate (which can set toa = 1 without a loss of generality). Next, we have to specify how agents choose a neighbor and an action based on their Q- function. Here we use the Boltzmann exploration mechanism where the probability of a particular choice is given as [52] p i xy = e bQ i xy å ˜ y; j e bQ j x ˜ y (7.2) Where, p i xy is the probability that the agent x will play with agent y and choose action i. Here the inverse temperatureb 1=T > 0 controls the tradeoff between exploration and exploitation; for T! 0 the agents always choose the action corresponding to the maximum Q–value, while for T!¥ the agents’ choices are completely random. We now assume that the agents interact with each other many times between two consecutive updates of their strategies. In this case, the reward of the i–th agent in Equation 7.1 should be understood in terms of the average reward, where the average is taken over the strategies of other agents, R i x;y =å j A i j xy p j yx , where A i j xy is the reward (payoff) of agent x playing strategy i against the agent y who plays strategy j. Note that generally speaking, the payoff might be asymmetric. We are interested in the continuous approximation to the learning dynamics. Thus, we replace t+ 1! t+dt,a!adt, and take the limitdt! 0 in (7.1) to obtain ˙ Q i xy =a[R i x;y Q i xy (t)] (7.3) 55 Differentiating 7.2, using Eqs. 7.2, 7.3, and scaling the time t!abt we obtain the following replicator equation [46]: ˙ p i xy p i xy = å j A i j xy p j yx å i; j; ˜ y A i j x ˜ y p i x ˜ y p j ˜ yx + T å ˜ y; j p j x ˜ y ln p j x ˜ y p i xy (7.4) Equations 7.4 describe the collective adaptation of the Q–learning agents through repeated game– dynamical interactions. The first two terms indicate that the probability of playing a particular pure strategy increases with a rate proportional to the overall goodness of that strategy, which mimics fitness-based selection mechanisms in population biology [25]. The second term, which has an entropic meaning, does not have a direct analogue in population biology [46]. This term is due to the Boltzmann selection mechanism that describes the agents’ tendency to randomize over their strategies. Note that for T = 0 this term disappears, so the equations reduce to the conventional replicator system [25]. So far, we discussed learning dynamics over a general strategy space. We now make the assumption that the agents’ strategies factorize as follows, p i xy = c xy p i x ; å y c xy = 1; å i p i x = 1: (7.5) Here c xy is the probability that the agent x will initiate a game with the agent y, whereas p i x is the probability that he will choose action i. Thus, the assumption behind this factorization is that 56 the probability that the agent will perform action i is independent of whom the game is played against. Substituting 7.5 in 7.4 yields, ˙ c xy p i x + c xy ˙ p i x = c xy p i x å j a i j xy c yx p j y å i;y; j a i j x;y c xy c yx p i x p j y T lnc xy + ln p i x å y c xy lnc xy å j p j x ln p j x (7.6) Next, we sum both sides in Equation 7.6, once over y and then over i, and make use of the normalization conditions in Eq. 7.5 to obtain the following co-evolutioanry dynamics of actions and connections probabilities: ˙ p i x p i x = å ˜ y; j A i j x ˜ y c x ˜ y c ˜ yx p j ˜ y å i; j; ˜ y A i j x ˜ y c x ˜ y c ˜ yx p i x p j ˜ y + T å j p j x ln(p j x =p i x ) (7.7) ˙ c xy c xy = c yxå i; j A i j xy p i x p j y å i; j; ˜ y A i j x ˜ y c x ˜ y c ˜ yx p i x p j ˜ y + T å ˜ y c x ˜ y ln(c x ˜ y =c xy ) (7.8) Equations 7.7 and 7.8 are the replicator equations that describe the collective evolution of both the agents’ strategies and the network structure. The following remark is due: Generally, the replicator dynamics in matrix games are invariant with respect to adding any column vector to the payoff matrix. However, this invariance does not hold in the present networked game. The reason for this is the following: if an agent does not have any incoming links (i.e., no other agent plays with him/her), then he always gets a zero reward. Thus, the zero reward of an isolated agent serves as a reference point. This poses a certain problem. For instance, consider a trivial game with a constant reward matrix a i j = P. If 57 P> 0, then the agents will tend to play with each other, whereas for P< 0 they will try to avoid the game by isolating themselves (i.e., linking to agents that do not reciprocate). To address this issue, we introduce an isolation payoff C I that an isolated agent receives at each round of the game. It can be shown that the introduction of this payoff merely subtracts C I from the reward matrix in the replicator learning dynamics. Thus, we parameterize the game matrix as follows: a i j = b i j C I (7.9) where matrix B defines a specific game. An interesting question is what the reasonable values for the parameter C I are. In fact, what is important is the value of C I relative to the reward at the corresponding Nash equilibria, i.e., whether not playing at all is better than playing and receiving a potentially negative reward. We believe that different values of C I describe different situations. In particular, one can argue that certain social interactions are apparently characterized by large C I , where not participating in a game is seen as a worse outcome than participating and getting negative rewards. In the following, we treat C I as an additional parameter that changes in a certain range, and examine its impact on the learning dynamics. 7.1.1 Two-action games We focus on symmetric games where the reward matrix is the same for all pairs(x;y), A xy = A: A= 0 B B @ a 11 a 12 a 21 a 22 1 C C A (7.10) 58 Let p a , a2fx;y;:::;g, denote the probability for agent a to play action 1 and c xy is the prob- ability that the agent x will initiate a game with the agent y. For two action games, the learning dynamics Eqs. (7.7) , and (7.8) becomes: ˙ p x p x (1 p x ) = å ˜ y (ap ˜ y + b)c x ˜ y c ˜ yx + T log 1 p x p x (7.11) ˙ c xy c xy = r xy R x + T å ˜ y c x ˜ y ln c x ˜ y c xy (7.12) where r xy = c yx (ap x p y + bp x + d p y + a 22 ) (7.13) R x = å ˜ y (ap x p ˜ y + bp x + d p ˜ y + a 22 )c x ˜ y c ˜ yx (7.14) Here we have defined the following parameters: a = a 11 a 21 a 12 + a 22 (7.15) b = a 12 a 22 (7.16) d = a 21 a 22 (7.17) The parameters a and b allow a categorization of two action games as follows: Dominant action game: b a > 1 or b a < 0 Coordination game: a> 0;b< 0 and 1 b a Anti-Coordination (Chicken) game: a< 0;b> 0 and 1 b a 59 b a Dominant action Dominant action Coordination game Chicken game b = -a Figure 7.1: Categorization of 2-action games base on the reward matrix structure —(a;b) space. Before proceeding further, we elaborate on the connection between the rest points of the replicator system for T = 0, and the game-theoretic notion of Nash Equilibrium (NE) 1 . For T = 0 (no exploration) in the conventional replicator equations, all NE are necessarily the rest points of the learning dynamic. The inverse is not true - not all rest points correspond to NE - and only the stable ones do. Note also that in the present model the first statement does necessarily hold. This is because we have assumed the strategy factorization Eq. 7.5, due to which equilibria where the agents adopt different strategy with different players is not allowed. Thus, any NE that do not have the factorized form simply cannot be described in this framework. The second statement, however, remains true, and stable rest points do correspond to NE. 1 Recall that a joint strategy profile is called Nash equilibrium if no agent can increase his expected reward by unilaterally deviating from the equilibrium. 60 !"#$%&'"($) *#+',,-) .) *) .) / ))))))0) 1 ))))))2) *) .%%"3#&-4%&) 5-,') 6) 7) 6) 8)))))))))))0) / ))))))2) 7) Figure 7.2: Examples of reward matrices for typical two-action games. 7.2 Learning without exploration Fo T = 0, the learning dynamics Eqs. 7.11, 7.12 attain the following form, ˙ p x p x (1 p x ) = å ˜ y (ap ˜ y + b)c x ˜ y c ˜ yx (7.18) ˙ c xy c xy = r xy R x (7.19) Consider the dynamics of strategies given by Equation 7.18. Clearly, the vertices of the simplex, p x =f0;1g are the rest points of the dynamics. Furthermore, in case the game allows a mixed NE, then the configuration where all the agents play the mixed NE p x =b=a is also a rest point of the dynamics. As it will be shown below, however, this configuration is not stable, and for T = 0, the only stable configurations correspond to the agents playing pure strategies. 7.2.1 3-player games We now consider the case of three players in two-action game. This scenario is simple enough for studying it comprehensively, yet it still has non-trivial structural dynamics, as we will demonstrate below. 61 !"#$%%&'()*)+,)(-#% X Z Y X Z Y X Z Y X Z Y Mixed strategy p∈{1,0,− b a } Nash strategy 1≥ p≥0 d b c a Figure 7.3: 3-player Nash equilibrium for Prisoner’s Dilemma and Coordination game; see the text for more details. 7.2.1.1 Nash Equilibria We start by examining the Nash equilibria for two classes of two-action games, Prisoner Dilemma (PD) and a coordination game. In PD, the players have to choose between Cooperation and De- fection, and the payoff matrix elements satisfy b 21 > b 11 > b 22 > b 12 ; see Fig. 7.2. In two-player PD game, defection is a dominant strategy – it always yields a better reward regardless of the other player choice – thus, the only Nash Equilibrium is a mutual defection. And in coordination game, the players have an incentive to select the same action. This game has two pure Nash equi- librium, where the agents choose the same action, as well as a mixed Nash equilibrium. A general coordination game reward elements have the relationship b 11 > b 21 ;b 22 > b 12 (see Fig. 7.2). In the 3-agent scenario, a simple analysis yields four possible network topologies correspond- ing to NE depicted in Fig. 7.3. In all of those configurations, the agents that are not isolated select strategies that correspond to two-agent NE. Thus, in the case of PD, non-isolated agents always 62 defect, whereas for the coordination game, they can select one of three possible NE. We now examine those configurations in more details. Configuration I In this configuration the agents x and y play only with each other, whereas the agent z s isolated: c xy = c yx = 1. Note that for this to be a NE, the agents x and y should not be “tempted” to switch and play with the agent z. For instance, in the case of PD, this yields p z b 21 < b 22 , otherwise players x, y will be better of linking with the isolated agent z and exploiting his cooperative behavior 2 . Configuration II In the second configuration there is one central agent who plays with the other two: c xz = c yz = 1;c zx + c zy = 1. Note that this configuration is continuously degenerate as the central agent can distribute his link weight arbitrarily among the two players. Additionally, the isolation payoff should be smaller then than the reward at the equilibrium (e.g., b 22 > C I for PD). Indeed, if the latter condition is reversed, then one of the agents, say x, is better off linking with y instead of z, thus “avoiding” the game altogether. Configuration III: The third configuration corresponds to a uniformly connected networks where all the links have the same weight c xy = c yz = c cx = 1 2 . It is easy to see that when all three agents play NE strategies, there is no incentive to deviate from the uniform network structure. Configuration IV: Finally, in the last configuration none of the links are reciprocated so that the players do not play with each other: c xy c yx = c xz c zx = c yz c zy = 0. This cyclic network is a Nash 2 Note that the dynamics will eventually lead to a different rest point where z is now plays defect with both x and y 63 equilibrium when the isolation payoff C I is greater than the expected reward of playing NE in the respective game. 7.2.1.2 Stable rest points of learning dynamics The Nash equilibria discussed in the previous section are the rest points of the replicator dynam- ics. However, not all of those rest points are stable, so that not all the equilibria can be achieved via learning. We now discuss the stability property of the rest points. One of the main outcomes of our stability analysis is that at T = 0 the symmetric network configuration is not stable. This is in fact a more general results that applies to n-agent net- works, as shown in Appendix 10.2. As we will demonstrate below, the symmetric network can be stabilized when one allows exploration. The second important observation is that even when the game allows mixed NE, such as in coordination game, any network configuration where the agents play mixed strategy is unstable for T = 0 (see Appendix 10.2). Thus, the only outcome of the learning is a configuration where the agents play pure strategies. The surviving (stable) configurations are listed in Fig. 7.4. Their stability can be establishes by analyzing the eigenvalues of the corresponding Jacobian. Consider, for instance, the configu- ration with one isolated player. The corresponding eigenvalues are l 1 = r xz r xy ; l 2 = r yz r yx ; l 3 = 0 l 4 =(1 2p x )(r 1 x r 2 x )< 0; l 5 =(1 2p y )(r 1 y r 2 y )< 0; l 6 = 0 64 !"#$%&'()*+,-.#/)*0'1*'!"#$%&'"($)*#+',,-).-,')) b 22 ≥−C I b 22 <−C I Action 1 Mixed strategy Action 2 !"#$%&'()*+,-.#/)*0'1*'!""#$%&'("&)*'+,) −C I >b 11 b 11 ≥−C I >b 22 −C I ≤b 22 Figure 7.4: 3-player learning stable rest points (a) Prisoner’s Dilemma (b) Coordination game. For Prisoner’s Dilemma this configuration is marginally stable when agents x, y play defect and r xy > 0, r yx > 0. It happens only when b 22 C I which means that the isolation payoff should be less than the expected reward for defection. Furthermore, one should also have r xz < r xy ; r yz < r yx , which indicate that the neither x nor y would get better expected reward by switching and playing with z (e.g., condition for NE). And for the coordination game , assuming that b 11 > b 22 this configuration is stable when b 11 C I > b 22 ; b 22 C I : Similar reasoning can be used for the other configurations shown in Fig. 7.4. Note, finally, that there is a coexistence of multiple equlibria for range of parameter, except when the isolation payoff is sufficiently large, for which the cyclic (non-reciprocated) network is the only stable configuration. 65 !"#$%#&'()*+&,-."#/& 0112#&'(#%&*#%32%4.56& S 2 S n S 4 S 3 ... Figure 7.5: Observed stable configurations of co-evolutionary dynamics for T = 0. 7.2.2 N-player games In addition to the three agent scenario, we also examined the co-evolutionary dynamics of general N-agent systems, using both simulations and analytical methods. We observed in our simulations that the stable outcomes of the learning dynamic consist of star motifs S n , where a central node of degree n 1 connects to n 1 nodes of degree 1 3 . Furthermore, we observed that the basin of attraction of motifs shrinks as motif size grows, so that smaller motifs are more frequent. We now demonstrate the stability of the star motif S n in n player two action games. Let player x be the central player, so that all other players are only connected to x, c ax = 1. Recall that the Jacobian of the system is a block diagonal matrix with blocks J 11 = ¶ ˙ c i j ¶c mn and J 22 = ¶ ˙ p m ¶ p n ; see Appendix 10.2. When all players play a pure strategy p i = 0;1 in a star shape motif, it can be shown that J 22 is diagonal matrix with diagonal elements of form (1 2p x )å ˜ y (ap ˜ y + b)c x ˜ y c ˜ yx , whereas J 11 is an upper triangular matrix, and its diagonal elements are either zero have the form (ap x p y + bp x + d p y + a 22 )c xy where x is the central player. For the Prisoner’s Dilemma, the Nash Equilibrium corresponds to choosing the second action (defection) , i.e. p a = 0. Then the diagonal elements of J 22 , and thus its eigenvalues, equal bc x ˜ y . J 11 , on the other hand, has n 2 2n eigenvalues , (n 1) of them are zero and the rest have the 3 This is true when the isolation payoff is smaller compared to the NE payoff. In the opposite case the dynamics settles into a configuration without reciprocated links. 66 form ofl =a 22 c x ˜ y . Since for the Prisoner’s Dilemma one has b< 0 then the start structure is stable as long as b 22 > C I . A similar reasoning can be used for the Coordination game, for which one has b < 0 and a+b> 0. In this case, the star structure is stable when either b 11 >C I or b 22 >C I , depending on whether the agents coordinate on the first or second actions, respectively. We conclude this section by elaborating on the (in)stability of the N-agent symmetric network configuration, where each agent is connected to all the other agents with the same connectivity 1 N1 . As shown in Appendix 10.5, this configuration can be a rest point of the learning dynamics Eq. (7.18) only when all agents play the same strategy, which is either 0;1 orb=a. Consider now the first block of the Jacobian in Eq. 10.5, J 11 = ¶ ˙ c i j ¶c mn . It can be shown that the diagonal elements of J 11 are identically zero, so that Tr(J 11 )= 0. Thus, either all the eigenvalues of J 11 are zero, or there is at least one eigenvalue that is positive, thus making the symmetric network configuration unstable at T = 0. 7.3 Learning with Exploration In this section we consider the replicator dynamics for non-vanishing exploration rate T > 0. For two agent games, the effect of the exploration has been previously examined in Ref. [30], where it was established that for a class of games with multiple Nash equilibria the asymptotic behavior of learning dynamics undergoes a drastic changes at critical exploration rates. Below we study the impact of the exploration in the current networked version of the learning dynamics. 67 Perturbed pure NE 0<p<1 Strong connection Weak connection Uniform connection (a) C xy T 0.0 (b) Figure 7.6: a) Possible network configurations for three-player PD (Fig. 7.2); (b) Bifurcation diagram for varying temperature. Two blue solid lines correspond to the configurations with one isolated agent and one central agent. The symmetric network configuration is unstable at low temperature (red line), and becomes globally stable above a critical temperature. For 3-player, 2- action games we have six independent variables p x ; p y ; p z ;c xy ;c yz ;c zx . The strategy variables evolve according to the following equations: ˙ p x (1 p x )p x = (ap y + b)w xy +(ap z + b)w xz + T log 1 p x p x ˙ p y (1 p y )p y = (ap z + b)w yz +(ap x + b)w xy + T log 1 p y p y ˙ p z (1 p z )p z = (ap x + b)w xz +(ap y + b)w yz + T log 1 p z p z ˙ c xy c xy (1 c xy ) = r xy r xz + T log 1 c xy c xy ˙ c yz c yz (1 c yz ) = r yz r yx + T log 1 c yz c yz ˙ c zx c zx (1 c zx ) = r zx r zy + T log 1 c zx c zx Here we have defined w xy = c xy (1c yz ), w xz =(1c xy )c zx , and w yz = c yz (1c zx ), and a;b;d are defined in Eqs. 7.15, 7.16 and 7.17. 68 Fig.7.6(a) shows three possible network configurations. The first two configurations are per- turbed version of a star motif ( stable solution for T = 0), whereas the third one corresponds to symmetric network where all players connect to the other players with equal link weights. Furthermore, in Fig. 7.6(b) we show the behavior of the learning outcomes for a PD game, as one varies the temperature. For sufficiently small T , the only stable configurations are the perturbed star motifs, and the symmetric network is unstable. However, there is a critical value T c above which the symmetric network becomes globally stable. We note that this type of behavior is general, and similar transition is observed for N-player games as well. Next, we consider the stability of the symmetric networks. As shown in Appendices 10.5 , and 10.6 , the only possible solution in this configuration is when all the agents play the same strategy, which can be found from the following equation: (ap+ b)= 2T log p 1 p (7.20) The behavior of this equation was analyzed in details in [30]. In particular, for games with a single NE, this equation allows a single solution that corresponds to the perturbed NE. For games with multiple equlibria, on the other hand, there is a critical exploration rate there is a critical temperature T c : For T < T c there are two stable and one unstable solution, whereas for T T c there is a single globally stable solution. We use these insights to examine the stability of the symmetric network configuration for the coordination game, depending on the parameters T and C I ; see Appendix 10.4. In this example a= 5 , b=2 , d= 1 for all three agents. Figure 8.1 shows the bifurcation diagram of p plotted versus T . Below the critical temperature three values for p exist that two of them ,who correspond 69 T c −b 22 −b 11 Figure 7.7: Domain of stable homogenous equilibria ( dark grey) for the Examples of Coordina- tion game in Fig. (7.2). The top figure shows Bifurcation of strategy p versus T for three player game. The Bottom figure shows stable homogenous domain for players while choosing first ac- tion ( smaller grey area) and second action(larger grey area). Here the critical temperature is T c = 0:36. to the perturbed pure Nash equilibrium, are stable. Fig. (8.1) shows the domain of T , and C I for stable homogenous equilibrium. When T! 0 the domain of C I shrinks until it becomes one point at T = 0 whereC I is equal to a pure Nash equilibrium (Fig. (8.1). 70 Part V Conclusion 71 We have presented a comprehensive analysis of two agent Q–learning dynamics with Boltz- mann action selection mechanism, where the agents exploration rates are governed by tempera- tures T X ;T Y . For any two action game at finite exploration rate the dynamics is dissipative and thus guaranteed to reach a rest point asymptotically. We demonstrated that, depending on the game and the exploration rates, the rest point structure of the learning dynamics is different. When T X = T Y , for games with a single NE (either pure or mixed) there is a single globally stable rest point for any positive exploration rate. Furthermore, we analytically examined the impact of exploration/noise on the asymptotic behavior, and showed that in games with multiple NE the rest–point structure undergoes a bifurcation so that above a critical exploration rate only one globally stable solution persists. Previously, a similar observation for certain games was observed numerically in Ref. [60], where the authors studied Quantal Response Equilibrium (QRE) among agents with bounded rationality. In fact, it can be shown that QRE corresponds to the rest–point of the Boltzmann Q-learning dynamics. A similar bifurcation pictures was also demonstrated for certain continuous action games [15]. In general, we observed that for T X 6= T Y , the learning dynamics is qualitatively similar for games with multiple NE. Namely, there is a bifurcation at critical exploration rates T c X and T c Y , so that the learning dynamics allows three (single) rest points below (above) those critical values. In particular, the rest points converge to the NE of the game when T X ;T Y ! 0. What is perhaps more interesting is that for certain games with a single NE, it is possible to have multiple rest points in the learning dynamics when T X 6= T Y . Those additional rest points persist only for a finite range of exploration rates, and disappear when the exploration rates T X and T Y tend to zero. We suggest that the sensitivity of the learning dynamics on exploration rate can be useful for validating various hypotheses about possible learning mechanisms in experiments. Indeed, most 72 empirical studies so far have been limited to games with a single equilibrium, such as matching pennies, where the dynamics is rather insensitive to the exploration rate. We believe that for different games (such as coordination or anti-coordination game), the fine–grained nature of the rest point structure, and specifically, its sensitivity to the exploration rate, can provide much richer information about learning mechanisms employed by the agents. A very recent work reporting similar results [28], which studies convergence properties and bifurcation in the solution structure using local stability analysis. For games with a single rest point such a Prisoner’s Dilemma, local stability is subsumed by the global stability demonstrated here. The bifurcation results are similar, even though [28] studies only coordination games and does not differentiate between continuous and discontinuous pitchfork bifurcation. Finally, the analytical form of the phase diagram Eq. 10.25 for the symmetric case, as well as the possibility of multiple rest points for games with a single NE demonstrated here, are complementary to the results presented in [28]. In this work we have extended the theoretical framework of learning dynamics by considering the implications of agent turnover. Our model captures the differences in experience within a population, which will be present in most game theoretical contexts. One competitive setting, where turnover of agents is especially important, is online games, where experienced players may exploit the na¨ ıve strategies of new players. We therefore studied a large dataset of bidding distributions in online lowest unique bid auctions and showed that our model is able to describe the empirical steady state distributions far better than the Nash equilibrium. We expect that similar results can be obtained by studying data from other games. When a turnover of players is introduced, the game dynamics become richer. We have shown that it is possible to encounter bifurcations of turnover equilibria, that the expected equilibrium 73 can change discontinuously with the turnover rate, and that players with intermediate amounts of experience may get higher payoffs than very experienced players. In conclusion, a gradual turnover of agents, which is present in most real-life games, may change the game dynamics considerably and improve the ability to theoretically describe learning dynamics of a population of competing individuals. Finally, we have studied the co-evolutionary dynamics of strategies and link structure in a net- work of reinforcement learning agents. By assuming that the agents’ strategies allow appropriate factorization, we derived a system of a coupled replicator equations that describe the mutual evo- lution of agent behavior and network topology. We used these equations to fully characterize the stable learning outcomes in the case of three agents and two action games. We demonstrated that in the absence of any strategy exploration (zero temperature limit) learning leads to network composed o star-like motifs. Furthermore, the agents on those networks play only pure NE, even when the game allows a mixed NE. Also, even though the learning dy- namics allows rest points with a uniform network (e.g., an agent plays with all the other agents with the same probability) , those equilibria are not stable at T = 0. The situation changes when the agents explore their strategy space. In this case, the stable network structures undergo bifur- cation as one changes the exploration rate. In particular, there is a critical exploration rate above which the uniform network becomes a globally stable outcome of the learning dynamics. We note that the main premise behind the strategy factorization use here is that the agents use the same strategy profile irrespective of whom they play against. While this assumption is perhaps valid under certain circumstances, it certainly has its limitations that need to be studied further through analytical results and empirical data. Furthermore, the other extreme where the agent employs unique strategy profiles for each of his partners does not seem very realistic either, as it 74 would impose considerable cognitive load on the agent. A more realistic assumption is that the agents have a few strategy profile that roughly correspond to the type of the agent he is interacting with. The approach presented here can be in principle generalized to the latter scenario. 75 Part VI Appendices 76 Chapter 8 BOLTZMANN Q-LEARNING APPENDICES 8.1 Classification of games according to the number of allowable rest-points Here we derive the conditions for multiple rest-points. We assume a;c> 0 for the sake of con- creteness. Consider the set of all the tangential lines to g(u) (see Eq. 5.24), and letd T Y (u) be the intercept of the tangential line that passes through point u, d T Y (u) = g(u) g 0 (u)u: Here the subscript indicates that the intercept depends on the exploration rate T Y via coefficients c and d. The extremum of functiond T Y (u) happens at dd T Y du =g 00 u= 0 where: g 00 (u)= cg(1 g) 16cosh 4 u 2 ctanh d 2 + c=2 1+ e u + 2sinhu (8.1) 77 Let u 0 be the point where g 00 (u 0 )= 0. A simple analysis yields that u 0 > 0 whenever d <c=2, and u 0 < 0 otherwise. Next, letd min = min u;T Y d T Y (u) andd max = max u;T Y d T Y (u), where minimiza- tion and maximization is over both u and T Y . It can be shown that there can be multiple solutions only whend min < b a <d max . We now consider different possibilities depending on the ratio d c . Due to symmetry, it is sufficient to consider d c < 1 2 . We differentiate the following cases: ı) 1< d c < 1 2 : In this case one hasd min =¥,d max = 1. Thus, there will be one rest point whenever b a <1. ıı) d c <1: In this case one hasd max = 1 2 , thus, there will be single rest point whenever b a < 1 2 . Furthermore, although an analytical expression for d min is not available, the corresponding boundary can be found by numerically solving a transcendental equation b a =d min for different d c . Repeating the same reasoning for d c > 1 2 yields the different regions depicted in Fig. 5.4. 8.2 Appearance of multiple rest points in games with single NE We now elaborate on games with single NE for which the learning dynamics still can have mul- tiple rest points. For the sake of concreteness, let us consider one of the regions in Fig. 5.4 that corresponds to b a > 0,1< d c <1=2. The graphical representation of the rest point equation is shown in Fig. 8.1. For a given T Y , the two lines correspond to the critical values of T c X (T y ) and T c + X (T Y ). Let us consider the case T Y = 0. It is easy to see that in this limit g(u) becomes a step function, g(u)= q(u ˜ u), where ˜ u is found by requiring d c + 1 1+e u = 0, which yields 78 −1 0 1 2 3 −0.5 0 0.5 1 1.5 u − b a T X = T c + X T X = T c − X g(u) Figure 8.1: Graphical illustration of the multi-rest point equation for a game with a single NE. Here a;c> 0, b a = 1 2 , d c = 3 4 . ˜ u= ln d d+c . Simple calculations yield T c X (T Y = 0)= ˜ a ˜ u b a and T c X (T Y = 0)= ˜ a ˜ u ( b a + 1), where ˜ a = a 21 + a 12 a 11 a 22 aT X . For general T Y > 0, the corresponding values T c X (T Y ) and T c + X (T Y ) can be found numerically. Finally, note that when increasing T Y , there is a critical ex- ploration rate T Y = T c Y so that for T Y > T c Y the multiple solutions will disappear. It is easy to see that T c Y corresponds to the point when the maximum value of the intercept to g(u) for a given T Y equals b a . 79 Chapter 9 LEARNING WITH NAIVE PLAYERS APPENDICES 9.1 Number of turnover equilibria in two-agent two-action games In this appendix, we show that the number of possible turnover equilibria in two-agent two-action games is dependent on the sign of the productab, wherea andb are given by (6.14) and (6.15). The conditions for turnover equilibrium are given by (6.12) and (6.13). Expressing y as a function of x , this can be written y (x )= c x a x x 0 x (1 x ) a 12 + a 22 a ; (9.1) bx + b 12 b 22 =c y y (x ) y 0 y (x )(1 y (x )) g(y ((x )); (9.2) where we have defined the composite function g(y ((x )). We see that g(1)= g(0)= 0 and that g(x ) has two singularity at y = 0 and y = 1. Since y is the probability of the second population to play strategy one in equilibrium, it must be between 0 and 1. Thus, only the interval of g(x ) between the two singularities are interesting. 80 Let us investigate the derivative ¶g ¶x . From Eq. (9.1) and (9.2) we get ¶y ¶x = c x a (x x 0 ) 2 +x 0 x 0 2 a(1x ) 2 x 2 (9.3) ¶g ¶x =c y (y y 0 ) 2 +y 0 y 0 2 (1y ) 2 y 2 ¶y ¶x : (9.4) Since x 0 and y 0 are probabilities, we always have x 0 > x 0 2 and y 0 > y 0 2 . The fractions in Eq. (9.4)and (9.4) are therefore positive, so the derivative ¶g ¶x must have the same sign as a. This means that g(x ) is either increasing or decreasing monotonically, depending on the sign ofa. If ab < 0, the line bx + b 12 b 22 and the function g(x ) have opposite slopes, and will therefore intersect exactly once. From (9.2) we see that there will then always be exactly one turnover equilibrium. In Fig. 9.1A this is illustrated for the matching pennies game with the same parameters as Fig. 6.1B. Ifab > 0, the linebx + b 12 b 22 and the function g(x ) are either both increasing or both decreasing, and they can therefore in principle intersect any odd number of times. Hence, it is possible to have multiple turnover equilibria, and these can appear or annihilate in pairs through either saddle point bifurcations or pitchfork bifurcations. Fig. 9.1B-C shows a saddle node bifurcation in the coordination game with the same parameters as Fig. 6.2D-E, except we have setc x equal to 0.8 and 1, respectively, to make the bifurcation clearer. 81 0 0 . 2 0 . 4 0 . 6 0 . 8 1 - 4 - 2 0 2 4 - 4 - 2 0 2 4 x * - 4 - 2 0 2 4 g ( y * ( x * ) ) β x * + b - b 1 2 2 2 0 < y * < 1 A B C Figure 9.1: The number of turnover equilibria in a two-action two-agent game can be determined by the number of intersections of left and right hand sides of equation 9.2. A When ab < 0 the two functions have opposite slopes and hence intersect exactly once in the relevant interval 0 < y < 1. Here the matching pennies game is illustrated with the parameters of Fig. 6.1B. B-C When ab > 0 the slope of the functions have the same sign, and new turnover equilibria can appear through bifurcations. Here the bifurcation in the coordination game of Fig. 9.1B-C is shown, except thatc x goes from to 0.8 to 1 between panels B-C. 82 Chapter 10 CO-EVOLUTIONARY LEARNING APPENDICES 10.1 Isolation cost (C p ) effect on learning dynamics We assume a punishment C p holds for a player if the player does not have a playmate. Let us for example consider a game with 3-player each with n-action. Here a player strategy is a distribution over n actions and two other players, so each agent has 2n choices. Let us say for agent x, x i represents i th choice, where all i n is for playing with agent y and all i > n is for playing with agent z. For agent y, j th choices is represented as y j , all j n is for playing with z and all j> n is for playing with agent x. For agent z, k th choices is represented z k , all k n is for playing with x and all k> n is for playing with agent y. Here we need to use step function q i , where q (i>n) means q i = 1 for all i larger than n and zero otherwise. To calculate the expected reward for agent x when it choose action i, r i x , r i x = 2n å j=1 y j q (in) [a i j q ( j>n) C p (1q ( j>n) )] + 2n å k=1 z k q (in) [a ik q (kn) C p (1q (kn) )] (10.1) 83 Substituting the 10.1 in the original replicator Eq. 7.19 we have the general equation of learning as ˙ p y;i x p y;i x = r y;i x r x (10.2) where r y;i x = å j (b i j +C p )p x; j y (10.3) r x = å i; j;a (b i j +C p )p a;i x p x; j a (10.4) We can repeat the same method for Eq. 7.18. 10.2 Rest points and Local Stability As noted earlier two set of dynamical equations describe the learning of agents select both their strategy and playmates simultaneously. In Eq.7.18 rest points are outcome of the learning in strategy space while Eq.7.19 rest points indicate the network topology outcome of the learning. To study the local stability properties of those rest points, we need to analyze the eigenvalues of the Jacobian matrix of the dynamical system. A rest–point is locally stable if all the eigenvalues l i of the Jacobian at that rest point are non–positive. 84 For n-player two-action game, we have n action variables and l= n(n 2) link variables, so that the total number of independent dynamical variables is n+ l = n(n 1). We can represent the Jacobian as follows, J= 0 B B B B @ ¶ ˙ c i j ¶c mn ¶ ˙ c i j ¶ p m ¶ ˙ p m ¶c i j ¶ ˙ p m ¶ p n 1 C C C C A = 0 B B @ J 11 J 12 J 21 J 22 1 C C A (10.5) Here the diagonal blocks J 11 and J 22 are l l and n n square matrices, respectively. Similarly, J 12 and J 21 are n n and n l matrices, respectively. For certain cases, some of the blocks of the Jacobian matrix nullify, which simplifies calcula- tions considerably. For example, we can calculate, J 21 = ¶ ˙ p i ¶c i j = p i (1 p i )c ji (ap i + b)= 0 (10.6) It means that if all players have playmate then one can show that the rest point condition in Eqs. (7.18) requires that p i = 0, p i = 1, or p i = b a , which yields J 21 = 0. We use the block matrix determinant identity (see Section 10.3) to calculate the characteristic polynomial which has the following factorized form, p(l)= det(J 11 lI)det(J 22 lI)= 0 (10.7) Later, Eq. 10.7 is applied for several fixed point in local stability analysis. 10.3 Block matrix identity Let us recall the block matrix determinant identity: If A2R nn n and D2R mm n then, 85 det 0 B B @ A B C D 1 C C A = det A det DCA 1 B (10.8) This identity helps to calculate the eigenvalues of J. If J= 0 B B @ A B C D 1 C C A (10.9) then characteristic polynomial p(l)= det(JlI) solution gives the jacobian eigenvalues. For example, when either of B or C is a zero matrix, it simplifies the characteristic polynomial as p(l)= det(AlI)det(DlI)= 0: (10.10) 10.4 Jacobian eigenvalues in Two-action Three-player Dynamics for homogenous solution The jacobian of the learning dynamics for three-player two- action game at homogenous equilib- ria is J 11 = 0 B B B B B B @ T v v v T v v v T 1 C C C C C C A 86 J 12 = 0 B B B B B B @ 0 m m m 0 m m m 0 1 C C C C C C A J 21 = 0 B B B B B B @ 0 g g g 0 g g g 0 1 C C C C C C A J 22 = 0 B B B B B B @ T k k k T k k k T 1 C C C C C C A : where v = ap 2 + bp+ d p+ b 22 +C p 4 (10.11) m = ap+ d 8 (10.12) g = p(1 p)(ap+ b) 2 (10.13) k = ap(1 p) 4 : (10.14) The six eigenvalues of the dynamics determine the stability of the homogenous equilibrium are: 87 l 1 = 2k T l 2 = T 2v l 3;4 = 1 2 (k 2T+ v q 12gm+(k+ v) 2 ) l 5;6 = 1 2 (k 2T+ v+ q 12gm+(k+ v) 2 ): 10.5 homogenous Network implies Symmetric strategy Let us consider a 2-action n players game. Each player i chooses action one with probability p i . Here we prove that player n and player n1 in homogenous network have the same strategy, i.e., p n = p n1 . Consider the Eq. 10.19 for players n, n 1 and n 2, p 1 + p 2 ++ p n2 + p n1 = k log p n 1 p n c (10.15) p 1 + p 2 ++ p n2 + p n = k log p n1 1 p n1 c (10.16) Where K= T(n 1) 2 a ; c= b(n 1) a (10.17) 88 Also we define function g as g(p n )= x n + k log p n (1 p n ) (10.18) Now , by subtracting the two Eq. 10.15 and 10.16, we have g(p n )= g(p n1 ). Since 0< p i < 1 , then function g is a monotonic function, so g(p n )= g(p n1 )$ pn= p n1 . By repeating the same reasoning for the remaining p i one can prove that p 1 = p 2 == p n . 10.6 N-player homogenous network solutions Let us consider the case of homogenous network in which n agents connect with other with equal probability of 1 n1 to n1 that is a rest point of Eqs. 7.11 , 7.12 . For any positive T > 0 the pure action solutions of the strategy dynamics , i.e. p= 0 or p= 1 are unstable rest points. 1 . A mixed strategy is a rest point of the Eq. 7.11, here we are interested on homogenous network then all the connection probabilities are 1 n1 . The mixed strategy obtains by solving the equation, å ˜ y (ap ˜ y + b)= T(n 1) 2 log p x 1 p x (10.19) In homogenous network, players have the same strategy(see Appendix 10.5). Thus Eq. 10.19 simplifies more as p x = p y == p. p is the solution of the following equation (ap+ b)= T(n 1)log p 1 p (10.20) 1 As describe in the appendix, in Eq.10.5 let A= ¶ ˙ p x ¶ p x . The eigenvalues of A are always positive at p= 0 and p= 1which is an enough condition for instability at the homogenous rest point of Eqs. 7.11 , 7.12. 89 As we already mentioned, this equation (aside from the factor n 1) was studied in [30]. It The number of solutions of Eq. 10.20 is sensitive to T value. So, there is a bifurcation of solutions in Eq. 10.20 with regard to the T . Eq. 10.20 can have one, two, or three solutions. At critical temperature two solutions exist and we have a= T(n 1) 1 p(1 p) ; (10.21) or, alternatively, p= 1 2 1 r 1 4T(n 1) a (10.22) Note that the above solution exists only when a T(n1) 4. Plugging 10.22 into 10.20, we find b=(n 1)(ln aa aa ) 1 2 (aa); a = q a 2 4T(n 1)a (10.23) Thus, for any given a 4, the equation has three solutions whenever b c < b< b + c , where b + c = T(n 1)(ln aa a+a ) aa 2 (10.24) b c = T(n 1)(ln a+a aa ) a+a 2 (10.25) In the grey area of the bifurcation graph ( Fig. 10.1), the multiple solutions domain, players in homogenous networks can pick out three strategy , while outside the this domain players can only adopt one strategy. In the reminder of his paper we will study the homogenous network of three players for different type of two-action games. 90 3 4 5 6 7 8 9 10 −7 −6 −5 −4 −3 −2 −1 a T(n−1) b T(n−1) b − c (a) b + c (a) Multiple Solutions Figure 10.1: Demonstration of strategy bifurcation in the space of parameters a , b, T and n for homogenous equilibria. 91 Bibliography [1] T. Borgers and R. Sarin. Learning through reinforcement and replicator dynamics,. Journal of Economic Theory, 77(1):1 – 14, 1997. [2] M. Bowling and M. Veloso. Rational and convergent learning in stochastic games. In IJCAI, volume 17, pages 1021–1026, 2001. [3] C.J. Burke, P.N. Tobler, M. Baddeley, and W. Schultz. Neural mechanisms of observational learning. PNAS, 107(32):14431, 2010. [4] R.R. Bush and F. Mosteller. A mathematical model for simple learning. Psychological Review, 58(5):313–323, 1951. [5] C.F. Camerer, T-Hua Hong, and J-Kuan Chong. A cognitive hierarchy model of games. The Quarterly Journal of Economics, pages 861–898, 2004. [6] S.; Castellano, C.;Fortunato and V . Loreto. Statistical physics of social dynamics. Reviews of Modern Physics, 81(2):591, 2009. [7] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In In proc. of AAAI-1998/IAAI -1998, 1998. [8] M. Costa-Gomes, V .P. Crawford, and B. Broseta. Cognition and behavior in normal-form games: An experimental study. Econometrica, 68(5):1193–1235, 2001. [9] I.D. Couzin, C.C. Ioannou, G. Demirel, T. Gross, C.J. Torney, A. Hartnett, L. Conradt, S.A. Levin, and N.E. Leonard. Uninformed individuals promote democratic consensus in animal groups. science, 334(6062):1578–1580, 2011. [10] G. Demange. Group formation in economics: networks, clubs and coalitions. Cambridge Univ. Press, 2005. [11] RA Fisher. The Theory of Natural Selection. Oxford University Press, London, 1930. [12] K. Friston. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2):127–138, 2010. [13] D. Fudenberg and D.K. Levine. The theory of learning in games. MIT press, 1998. [14] Tobias Galla. Intrinsic noise in game dynamical learning. Physical review letters, 103(19):198702, 2009. 92 [15] Aram Galstyan. Continuous strategy replicator dynamics for multi-agent q-learning. Au- tonomous Agents and Multi-Agent Systems, pages 1–17, 2009. [16] I. Gilboa and A. Matsui. Social stability and equilibrium. Econometrica: Journal of the Econometric Society, pages 859–867, 1991. [17] J.K. Goeree and C.A. Holt. Stochastic game theory: For playing games, not just for doing theory. Proceedings of the National Academy of Sciences, 96(19):10564–10567, 1999. [18] E. Gomes and R. Kowalczyk. Dynamic analysis of multiagent Q-learning with e-greedy exploration. In Proc. of ICML-2009, 2009. [19] S. Goyal. Connections: an introduction to the economics of networks. Princeton University Press, 2009. [20] T. Gross and B. Blasius. Adaptive coevolutionary networks: a review. Journal of the Royal Society Interface, 5(20):259, 2008. [21] J.J. Halpern and R.N. Stern. Debating Rationality: Non Rational Aspects of Organiza- tional Decision Making. Frank W. Pierce Memorial Lectureship and Conference Series. Ilr Press/Cornell, 1998. [22] J. Hofbauer. The selection mutation equation. Journal of mathematical biology, 23(1):41– 53, 1985. [23] J. Hofbauer. Evolutionary dynamics for bimatrix games: A hamiltonian system? Journal of Mathematical Biology, 34:675–688, 1996. [24] J. Hofbauer and E. Hopkins. Learning in perturbed asymmetric games. Games and Eco- nomic Behavior, 52(1):133–152, 2005. [25] J. Hofbauer and K. Sigmund. Evolutionary games and Population dynamics. Cambridge University Press, 1998. [26] J. Hofbauer and K. Sigmund. Evolutionary game dynamics. Bulletin of the American Mathematical Society, 40(4):479, 2003. [27] E. Hopkins. Two Competing Models of How People Learn in Games. Econometrica, 70(6):2141–2166, 2002. [28] Michael Kaisers and Karl Tuyls. Faq-learning in matrix games: Demonstrating convergence near nash equilibria, and bifurcation of attractors in the battle of sexes. In AAAI IDGT11 workshop, 2011. [29] A. Kianercy and A. Galstyan. Dynamics of boltzmann q learning in two-player two-action games. Physical Review E, 85(4):041145, 2012. [30] A. Kianercy and A. Galstyan. Dynamics of boltzmann q learning in two-player two-action games. Physical Review E, 85(4):041145, 2012. 93 [31] S. Kim, J. Hwang, H. Seo, and D. Lee. Valuation of uncertain and delayed rewards in primate prefrontal cortex. Neural Networks, 22(3):294–304, 2009. [32] D. Lee, M.L. Conroy, B.P. McGreevy, and D.J. Barraclough. Reinforcement learning and decision making in monkeys during a competitive game. Cognitive Brain Research, 22(1):45–58, 2004. [33] D.S. Leslie and E.J. Collins. Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2):495–514, 2006. [34] Panos L Lorentziadis. Optimal bidding in auctions of mixed populations of bidders. Euro- pean Journal of Operational Research, 217(3):653–663, 2012. [35] J. Maynard Smith. Evolution and the theory of games. Cambridge University Press, 1982. [36] John Maynard Smith. The theory of games and the evolution of animal conflicts. Journal of theoretical biology, 47(1):209–221, 1974. [37] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995. [38] J. Nash. Equilibrium points in n-person games. Proc. Natl. Acas. Sci., 36(1):48–49, 1950. [39] John Nash. Non-cooperative games. The Annals of Mathematics, 54(2), 1951. [40] M.A. Nowak. Evolutionary Dynamics: Exploring the Equations of Life. Belknap Press, 2006. [41] M.A. Nowak, K.M. Page, and K. Sigmund. Fairness versus reason in the ultimatum game. Science, 289(5485):1773–1775, 2000. [42] Jorge M Pacheco, Arne Traulsen, and Martin A Nowak. Coevolution of strategy and struc- ture in complex networks with dynamical linking. Physical Review Letters, 97(25):258103, 2006. [43] M. Perc and A. Szolnoki. Coevolutionary games–a mini review. BioSystems, 99(2):109– 125, 2010. [44] Y . Sato, E. Akiyama, and J.P. Crutchfield. Stability and diversity in collective adaptation. Physica D: Nonlinear Phenomena, 210(1-2):21 – 57, 2005. [45] Y . Sato, E. Akiyama, and J. D. Farmer. Chaos in learning a simple two-person game. PNAS, 99(7):4748–4751, 2002. [46] Y . Sato and J.P. Crutchfield. Coupled replicator equations for the dynamics of learning in multiagent systems. Physical Review E, 67(1), 2003. [47] Peter Schuster, Karl Sigmund, Josef Hofbauer, Ramon Gottlieb, and Philip Merz. Selfregu- lation of behaviour in animal societies. Biological Cybernetics, 40(1):1–25, 1981. [48] G. Silverberg. Evolutionary modeling in economics: recent history and immediate prospects. Citeseer, 1997. 94 [49] S. Singh, M. Kearns, and Y . Mansour. Nash convergence of gradient dynamics in general- sum games. In Proc. of Uncertainty in AI-2000, 2000. [50] S. Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry and engineering. Perseus Books Group, 2001. [51] S. H. Strogatz. Nonlinear Dynamics And Chaos. Westview Press, 2001. [52] R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. The MIT press, 2000. [53] Corina E Tarnita, Tibor Antal, and Martin A Nowak. Mutation-selection equilibrium in games with mixed strategies. Journal of theoretical biology, 261(1):50, 2009. [54] A. Traulsen, J.C. Claussen, and C. Hauert. Coevolutionary dynamics: From finite to infinite populations. Phys. Rev. Lett., 95:238701, 2005. [55] K. Tuyls, P.J.T. Hoen, and B. Vanschoenwinkel. An evolutionary dynamical analysis of multi-agent learning. JAAMAS, 12(1):115–153, 2006. [56] K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model for q-learning in multi- agent systems. In proc. of AAMAS-2003, pages 693–700, 2003. [57] J. V on Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. [58] C.J.C.H. Watkins and P. Dayan. Q-learning. Machine learning, 8(3):279–292, 1992. [59] C.J.C.H. Watkins and P. Dayan. Technical note: Q-learning. Machine learning, 8(3):279– 292, 1992. [60] David H Wolpert, Michael Harr´ e, Eckehard Olbrich, Nils Bertschinger, and Juergen Jost. Hysteresis effects of changing the parameters of noncooperative games. Physical Review E, 85(3):036102, 2012. [61] J. R. Wright and K. Leyton-Brown. Beyond equilibrium: Predicting human behavior in normal-form games. Proceedings of the Twenty-Fourth AAAI Conference on Artificial In- telligence, pages 901–907, 2010. [62] M. Wunder, M. Littman, and M. Babes. Classes of multiagent Q-learning dynamics with e-greedy exploration. In Proc. of ICML-2010, 2010. [63] E Zeeman. Population dynamics from game theory. Global theory of dynamical systems, pages 471–497, 1980. [64] Gerd Zschaler. Adaptive-network models of collective dynamics. The European Physical Journal-Special Topics, 211(1):1–101, 2012. 95
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
No-regret learning and last-iterate convergence in games
PDF
Machine learning in interacting multi-agent systems
PDF
Zero-sum stochastic differential games in weak formulation and related norms for semi-martingales
PDF
Stochastic and multiscale models for urban and natural ecology
PDF
Designing an optimal software intensive system acquisition: a game theoretic approach
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Dealing with unknown unknowns
PDF
Set values for mean field games and set valued PDEs
PDF
Modeling and analysis of parallel and spatially-evolving wall-bounded shear flows
PDF
The evolution of multilevel organizational networks in an online gaming community
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Learning and decision making in networked systems
PDF
Online learning algorithms for network optimization with unknown variables
PDF
The spread of an epidemic on a dynamically evolving network
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Algorithms and landscape analysis for generative and adversarial learning
Asset Metadata
Creator
Kianercy, Ardeshir
(author)
Core Title
Adaptive agents on evolving network: an evolutionary game theory approach
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Mechanical Engineering
Publication Date
04/26/2013
Defense Date
03/04/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bifurcation,Boltzmann distribution,co-evolutionary dynamics,contagion,coordination game,economic network,evolutionary dynamics,evolutionary game theory,free energy minimization,games on network,homophily,learning in games,network formation,OAI-PMH Harvest,phase transition,Q-learning,reinforcement learning,replicator dynamics,replicator equation,social network,soft-max learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Newton, Paul K. (
committee chair
), Carrillo, Juan D. (
committee member
), Galstyan, Aram (
committee member
), Ghanem, Roger G. (
committee member
)
Creator Email
akianercy@yahoo.com,kianercy@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-243712
Unique identifier
UC11287940
Identifier
etd-KianercyAr-1593.pdf (filename),usctheses-c3-243712 (legacy record id)
Legacy Identifier
etd-KianercyAr-1593.pdf
Dmrecord
243712
Document Type
Dissertation
Rights
Kianercy, Ardeshir
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bifurcation
Boltzmann distribution
co-evolutionary dynamics
contagion
coordination game
economic network
evolutionary dynamics
evolutionary game theory
free energy minimization
games on network
homophily
learning in games
network formation
phase transition
Q-learning
reinforcement learning
replicator dynamics
replicator equation
social network
soft-max learning