Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reinforcement learning based design of chemotherapy schedules for avoiding chemo-resistance
(USC Thesis Other)
Reinforcement learning based design of chemotherapy schedules for avoiding chemo-resistance
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
REINFORCEMENT LEARNING BASED DESIGN OF CHEMOTHERAPY SCHEDULES FOR
AVOIDING CHEMO-RESISTANCE
by
Matthew Levi Giles
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF ARTS
(APPLIED MATHEMATICS)
May 2024
Copyright 2024 Matthew Levi Giles
Acknowledgements
I’d like to acknowledge Dr. Paul K. Newton, who first introduced me to the topic of evolutionary games
and encouraged me to pursue this degree. Without his continuous support and insightful feedback, this
work would be far less interesting, and I would be a far less effective analyst and mathematician.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Cancer, Chemotherapy, and Chemo-Resistance (Motivation) . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution of Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Chemo-Resistance and Cost of Resistance . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Competitive Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Evolutionary Game Theory and the Prisoner’s Dilemma . . . . . . . . . . . . . . . 5
1.1.5 Replicator Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 General Setup and Background for Reinforcement Learning . . . . . . . . . . . . . 10
1.2.2 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Chemotherapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Q-Learning Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Reward Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Learned Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Most Probable Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Stochastic Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Simplified Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 4: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iii
Abstract
While chemotherapy is an effective tool at combating cancer, it has also been known to promote the evolution of chemo-resistance within a tumor by the mechanism of competitive release of chemo-resistant cancer phenotypes. In this work, a reinforcement-learning approach is used to develop optimized chemotherapy schedules to control the coevolution of three cancer cell phenotypes. A stochastic, discrete-time evolutionary game is established as a model upon which to train the reinforcement learning algorithm. This
model includes two chemotoxins, C1 and C2, and three coevolving tumor cell populations: S, sensitive
to both C1 and C2; R1, resistant to C2 but sensitive to C1; and R2, resistant to C1 but sensitive C2. In
the absence of chemotherapy, a prisoner’s dilemma game is played between the S population and each of
the resistant populations, establishing a "cost to resistance." Learned policies are evaluated by examination
of the evolutionary trajectories they promote, as well as by probabilistic experiments to determine their
effectiveness in avoiding species extinction (which would ultimately result in fixation of a chemo-resistant
phenotype). It was found that the learned policies were able to achieve species diversity by leveraging the
natural dominance of the S subpopulation to out-compete the resistant populations, while administering
chemotherapy when necessary to prevent fixation of any individual cell type.
iv
Chapter 1
Introduction
1.1 Cancer, Chemotherapy, and Chemo-Resistance (Motivation)
1.1.1 Evolution of Cancer
Cancer is a disease in which bodily cells begin to reproduce and spread uncontrollably, to the detriment
of the host. Typically, cell growth is a very regulated process: cells receive signals to undergo cell division
(to multiply) or trigger apoptosis (programmed cell death), and perform these functions only when such
signals are present. [30][1] Occasionally however, by a series of random mutations an individual cell may
develop a way to avoid these regulating mechanisms, granting it the ability to ignore apoptosis signalling,
undergo cell division even in the absence of the appropriate signals, or otherwise gain some benefit that
allows it to proliferate freely. This is a cancerous cell. The cell may then multiply to form a tumor, invading
nearby tissues or spreading to distant sites (metastasis). Subsequent mutations may ensue, providing the
growing cancer population with further advantages to their reproductive ability or longevity.
When viewed from an ecological perspective, this process is a form of somatic evolution: a series of
coincidental mutations lead to phenotypical changes in a cell that provide it with a competitive advantage
over its neighbors. The mutated (cancer) cell population is then able to invade the healthy cell population.
Though individual cancer cells demonstrate higher reproductive prowess than individual healthy cells,
1
unregulated cancer growth will eventually lead to death of both the patient and the tumor. Thus, cancer development is a form of clonal evolution marked by Darwinian selection for proliferation, survival,
and dispensation to further evolution on a cellular level. This starkly contrasts selection at the system
(multicellular) level, where cell cooperation and differentiation is valued. [2][21]
1.1.2 Chemo-Resistance and Cost of Resistance
According to the most recent data from the Center for Disease Control (CDC), "Malignant Neoplasms"
(i.e. cancerous tumors) are the second leading cause of death across all age groups in the United States.
[10] To combat cancer mortality, a wide suite of treatments for this disease have been developed and implemented. Chemotherapy broadly describes a category of these cancer treatments that utilize chemical
agents (also called chemotoxins, or chemotherapeutic agents) to eliminate cancerous cells. Chemotherapeutic approaches to cancer treatment are common and can often result in significant reduction of tumor
mass, particularly in the early stages of treatment. However, chemotherapy is also known to promote
chemo-resistance in the evolving cancer population, resulting in failure of subsequent treatment. The exact mechanisms for resistance may vary, and depends on the administered drug. As one example, consider
the drug methotrexate, a drug commonly involved in the treatment of patients afflicted with acute lymphoblastic leukemia (ALL), among other types of cancer. For a cancer drug to exert any effect, it must
first successfully enter the targeted cancer cells, sometimes in very specific intracellular concentrations.
Solute carrier (SLC) transporters are a class of proteins which cells use to facilitate the influx of nutrients,
vitamins, or drugs. One specific type of SLC transporter known as glycoprotein-reduced folate carrier
(RFC) is important in moving methotrexate across the cell membrane. It has been found that a short series
of mutations can lead to an expressed reduction in the tendency for methotrexate to bind RFC in ALL
patients, limiting drug influx and resulting in drug resistance. [29]
2
The mechanism outlined above has been shown to grant resistance not only to treatment by methotrexate, but also by pralatrexate, another cancer drug that similarly relies on RFC transporters to enter the cell.
On the other hand, drugs that do not rely on these transporters will not have their performance impacted
by this mutation. For example, the chemotoxin cisplatin has been found to interact with the transporters
Ctr1, Ctr2, ATP7A, ATP7B, OCT2, and MATE1, [5] and therefore a phenotypic change in RFC transporters
would not be expected to grant a cell any resistance to cisplatin. Thus while a cell may develop resistance
to one (or several) drugs, it may still exhibit vulnerability to another drug.
Moreover, reduced activity of RFC transporters may simultaneously reduce influx of nutrients and
other important resources to the cell. In the context of evolutionary selection, this implies that in the
absence of chemotherapy, the resistant cell would be expected to exhibit a lower reproductive fitness than a
chemo-sensitive cancer cell that does not express any phenotypic change in its RFC transporters. This is the
principle of a "cost to resistance:" the idea that the development of resistance to a drug often comes at the
cost of a decline in fitness when the drug is absent. [15] The principle has been used to describe observed
phenomena in a variety of ecological contexts, including bacterial evolution in response to antibiotics,
[22][6] resistance of pests to insecticide [12][17], and resistance of cancer cells to chemotherapy [15][36].
Note however that while the principle of a cost to resistance has been directly or indirectly measured
in several particular cases, this is not a universal observation, and there are instances where observed
developments of drug resistance has not conferred any noticeable fitness cost. [22] Nevertheless, it is
hoped that these imposed fitness tradeoffs are prevalent enough that they can be leveraged to avoid the
evolution of resistance.
3
1.1.3 Competitive Release
If we take this principle as an assumption, chemotherapy can be understood to bring about chemoresistance by the mechanism of competitive release. Competitive release is an ecological phenomenon in which
the removal of one species from an ecosystem relieves another species of competition, allowing it to thrive.
This is unsurprising: if a species’ growth is inhibited by competition with its neighbor (e.g. for food, or
space), it seems obvious that removing the neighbor from the environment would enable the species to
proliferate freely. A classic study demonstrating this phenomenon in nature can be found in [8]. In this
work, the author identified two competing species of barnacle residing along a stretch of the Scottish coastline: Chthamalus stellatis, and Balanus balanoides. The adults of each species resided in different zones of
the shoreline. Chthamalus adults were almost exclusively found in the upper zone, and Balanus adults in
the lower. However, some Cthamalus young were identified in the lower zone, indicating that Cthamalus
was able to settle, but not survive, this lower zone. When Connel experimentally removed the Balanus
barnacles from regions settled by young Cthamalus, the Cthamalus were found to exhibit much greater
rates of survival than in comparable regions where Balanus were not disturbed. This implies that the poor
survival rate of Cthamalus in these lower regions was not due to its poor suitability to these conditions, but
instead due to the competition from Balanus for space and resources. By removing Balanus, the Cthamalus
species were then able to thrive.
It is easy to draw comparisons between the analogy presented in this study and the invasion of resistant
cancer cells in a tumor. Consider a heterogeneous neoplasm comprised of mostly chemo-sensitive cells,
but also featuring a small subpopulation of cells that have mutated a resistance to a particular drug. If
this developed resistance also bears a fitness cost, then prior to the start of treatment it is expected that
the sensitive cells will outcompete the resistant cells, which will have difficulty proliferating and invading
the tumor. However, once chemotherapy is administered the sensitive population will begin to die out,
4
eliminating the primary competitor for the chemo-resistant population. This resistant subpopulation can
then invade the tumor, resulting in failure of subsequent chemotherapy treatment.
Having introduced the necessary background information, the goal of this work can be stated more
clearly: to develop chemotherapy schedules that can leverage fitness costs associated with chemoresistance to maintain an extant population of chemo-sensitive cells and avoid competitive release of a
chemo-resistant population.
1.1.4 Evolutionary Game Theory and the Prisoner’s Dilemma
Constructing chemotherapy schedules to control the trajectory of cancer evolution requires an acute understanding of the associated dynamics. As discussed in section 1.1.1, cancer can be seen as a clonal
evolutionary system, to which much of the existing theory of ecological evolutionary dynamics can be
applied. In particular, evolutionary game theory provides a promising framework for modeling cancer
development. [27][7][42][45][35] In this approach, individual cells are modeled as players in a dynamic
game, who receive payoffs for their interactions with other players. The payoff received by an individual is
a function of its strategy as well as the strategies chosen by each other player in the game, and corresponds
to reproductive prowess. In the context of cell evolution, a strategy refers to a set of inheritable traits that
are fixed for any given individual, rather than a chosen tactic that may change over time. Individuals with
successful strategies reproduce while others are outcompeted and eliminated. The theory of this field was
first developed by mathematical biologists as early as the 1980’s [34] and has since proven to be a powerful
quantitative tool in understanding the dynamics of evolutionary selection. [42]
To apply evolutionary game theory in practice requires an accompanying fitness landscape which maps
the genotype of individuals in the game to their fitness. Notably, fitness is a function of the frequency of
each strategy in the population. Recall that individuals in an evolutionary game receive payoffs based on
5
their interactions with other individuals in the game. This payoff is quantified by a payoff matrix, which
has the following general structure (assuming n allowable strategies):
M =
1 . . . n
1 m1,1 . . . m1,n
.
.
.
.
.
.
.
.
.
.
.
.
n mn,1 . . . mn,n
(1.1)
where mi,j is the payoff received by an individual of strategy i when interacting with an individual of
strategy j. Fitness can then be computed as the expected payoff received by an individual (see section 2.1
for further details).
Of particular relevance to cancer evolution is the prisoner’s dilemma game. This is a two-player game
traditionally described in terms of a cooperating strategy (C) and a defecting strategy (D). Its payoff matrix
can be given by:
M =
C D
C a b
D c d
=
C D
C 3 0
D 5 1
(1.2)
The defining aspect of a prisoner’s dilemma is the structure c > a > d > b, while the exact values
(5, 3, 1, 0) are frequently encountered in literature but do not hold any special relevance. Notice that this
game has a Nash equilibrium fixed at mutual defection. That is, for each player, choosing to defect will
yield greater personal payoff, regardless of the opponent’s strategy. On the other hand, mutual defection
awards each player with the relatively small payoff d = 1 (though still preferable to receiving payoff
6
b = 0), whereas mutual cooperation provides each player the payoff c = 3. Thus even though cooperation
yields the greatest total reward for the whole system, rational players are locked into mutual defection.
The prisoners dilemma is well studied due to its relevance in various fields including biology, economics, and political science. [27][28] It is also often leveraged when discussing cancer evolution, where
healthy cells are labelled cooperators and cancer cells are labelled defectors. The cooperators behave in
a cohesive manner, responding as intended to signals triggering cell division and apoptosis so that the
population remains balanced. Mutual cooperation yields the greatest total reward for the system and it is
unsurprising that this cellular behavior would be selected for on a multicellular level. On the other hand,
defectors (cancer cells) thrive in a population of cooperators (healthy cells) but once defection saturates
the system, the total payoff of the multicellular system is minimized, leading to death of the entire cell
population. Moreover, once defection is established it becomes difficult for cooperation to re-emerge. It is
this analogy that makes the prisoner’s dilemma so common in cancer modelling. [42][14][16]
In this work, the interactions between sensitive and resistant cancer cells are considered rather than
between cancer cells and healthy cells. Nevertheless, a prisoner’s dilemma game is still imposed between
these species, with reasoning as follows: In the absence of chemotherapy, it is expected that the sensitive
cells (defectors) to outcompete resistant cells (cooperators) owing to the "cost of resistance", discussed in
section 1.1.2. This implies the payoff structure c > a and d > b. However, when there is no inter-species
competition, the benefit of resistance provides a fully-resistant population with a competitive edge over a
fully-sensitive population, hence the relation a > d. In combination, these inequalities yield the familiar
prisoner’s dilemma payoff structure c > a > d > b. Further details on the fitness landscape resulting from
this game and the corresponding model for cancer evolution is further discussed in section 2.1.
7
1.1.5 Replicator Equation
To see how evolutionary games are used to predict population dynamics, consider an infinite population
of individuals, each of a specific phenotype that governs its interactions with others in its environment.
Suppose that the proportion (or frequency) of the population with phenotype i is given by xi
, and that
the interactions between individuals can be quantified by the payoff matrix presented in Equation 1.1.
Assume that each individual in the population is equally likely to interact with every other individual.
Then the probability of interacting with an individual of phenotype j is simply xj . From this assumption,
it is possible to compute the expected payoff of an individual of phenotype i at any time as a function of
the current population frequency ⃗x = (x1, x2, ...xn):
fi(⃗x) = X
j
xjmi,j (1.3)
where mi
, j is the (i, j)
th element of the payoff matrix. By equating expected payoff with reproductive
fitness, the average fitness of the population can be written:
ϕ(⃗x) = X
i
xjfi
(1.4)
Naturally, a subpopulation whose phenotype expresses greater fitness than the population average should
be expected to reproduce and grow. Moreover, it is not unreasonable to assume that the rate of this growth
should be proportional to the incremental fitness advantage exhibited by the phenotype over its competitors. A similar argument can be made for a subpopulation expressing a lesser fitness; that its frequency in
the population should be expected to shrink, reasonably at a rate proportional to the fitness disadvantage
it exhibits. This reasoning gives rise to the dynamic equation:
8
x˙i = xi(fi(⃗x) − ϕ(⃗x)) (1.5)
which is famously called the replicator equation, [37] and is a key result in the field of evolutionary dynamics. It describes how a population of many phenotypes evolves based on the frequency-dependent
fitness of individuals.
It should be noted that while the model employed to describe cancer evolution in this work (presented
later in Section 2.1) is based on the same principles as those introduced above, it does feature some key
dynamical differences. These differences arise from the fact that the model is a stochastic one, set in a
fixed, finite population of size N, rather than an infinite one, giving rise to interesting random phenomena
that are not present in the dynamics of Equation 1.5.
1.2 Reinforcement Learning
The stated task of designing chemotherapy schedules is ultimately a control problem: the dynamical system under study is the evolving cancer population, the objective function is a maximization of cell heterogeneity (discussed in section 2.2.1), and the control parameter is the administered chemotherapy concentration. In this work, a reinforcement learning-based approach is employed to develop optimal chemotherapy
policies. An introduction to this topic is provided below, adapted from [4].
Reinforcement Learning (RL) is a branch of machine learning that is particularly well-suited to optimization problems in the context of control policies. A reinforcement learning agent can sense the state
of its environment and choose a corresponding action to take. It is then rewarded or punished, based on
the result of its chosen action, and assigns values to the action accordingly. Through repeated trial and
error, the agent refines the value assigned to each action, at every state, to develop (i.e. “learn”) an optimal
9
control policy for maximizing its expected reward. This positive/negative reinforcement approach mimics
behavioral learning in animals, where repeated trial and error experimentation is met with positive or
negative feedback. Usually, RL algorithms involve an initial exploration phase where the agent performs
this trial and error learning sequence, followed by an exploitation phase, where the agent uses the policy
it has learned to maximize the benefit it receives from its environment.
An important feature of RL is that the agent is able to learn strategies that provide delayed rewards.
That is, the agent can consider and fairly evaluate actions that provide lesser immediate benefit but potentially greater long-term benefit. This enables RL to identify globally optimal policies in situations where
other machine learning approaches might only find local optima.
1.2.1 General Setup and Background for Reinforcement Learning
An underlying assumption in RL is that the RL agent operates in an environment that is describable by
a Markov decision process (MDP). MDPs are a discrete-time stochastic process defined by the 4-tuple
(S, A, T, R) where:
• S is the set of possible states s for the environment.
• A is the set of allowable actions a that the RL agent may take.
• T := T(s, s′
, a) is the transition probability function, describing the probability that the state will
transition to s
′ given that the current state is s and the RL agent takes the action a.
• R := R(s, s′
) is the immediate reward function that the agent recieves by transitioning from state s
to s
′
.
By fixing the action a and writing T(s, s′
, a) = Ta(s, s′
), it is clear to see that the tuple (S, Ta) defines an
ordinary Markov process. Thus an MDP is simply an extension of an MP to include actions and rewards,
and importantly shares the property that at any time, the probability of being in a future state depends
10
only on the current state of the system, and not previous states. If the state space is discrete and finite,
then Ta := Ta(s, s′
) is simply a transition matrix and the vector ⃗s ϵ R
n
can be defined as the probability
of being in each of the n states s ϵ S. Then for any fixed action, ⃗s ′ = Ta⃗s defines a Markov Process.
In reinforcement learning, an RL agent is able to sense the state s ϵ S of its environment and take an
action a ϵ A of its choosing. The choice of a is decided by a policy π(s, a) = Pr[ a = a | s = s ]. That
is, π(s, a) assigns a probability to choosing each allowable action a, given that the current state is s. The
goal of RL is to optimize π(s, a) to maximize future rewards from any starting state. In the context of this
work, the state space S and action space A are discrete, and the learned policy π(s, a) takes the form of a
lookup table assigning a value to each pair (s, a) ϵ (S × A) such that for all s ϵ S,
P
a ϵ A
π(s, a) = 1. Then
equation describing the Markov state transition probabilities can be extended to:
⃗s ′ =
X
a ϵ A
π(s, a)Tas (1.6)
to accommodate a non-fixed action a.
The notation sk and ak will now be adopted to reflect the system state and action chosen, respectively,
at time step k. Rewards factor into RL through a value function, which describes the desirability of being
in a given state. This function is always conditioned on a policy π. If rk := R(sk−1, sk | π) is the reward
received at step k of the MDP, then the value function conditioned on π is given by:
Vπ(s) = E(
X
k≥1
γ
k
rk | s0 = s) (1.7)
This is the sum of expected rewards over all future time steps k, discounted at a rate γ. The purpose of
the discount rate is to capture the well-established economic principle that delayed (i.e. future) rewards
11
are less valuable than immediate rewards. The previously stated goal of RL can then be formalized as a
determination of:
arg max
π(s,a)
( Vπ(s) ) (1.8)
for all s ϵ S, which yields the optimal policy for maximizing rewards.
1.2.2 Q-Learning
RL techniques can be subdivided into model-based and model-free techniques. When the MDP being
investigated evolves according to a known model, there exist RL methods that can leverage this information
to quickly learn effective policies. When such a model is not available, other strategies must be used. This
work focuses on cancer evolution, which is a complex system that is not comprehensively described by any
existing mathematical model. For this reason, a model-free approach is taken to developing chemotherapy
schedules. A cancer-chemo model (described in section 2.1) is still provided as a testbed and placeholder, to
emulate experimental trials and evaluate learned policies, however the RL agent does not have direct access
to this model. The remainder of this discussion is therefore limited to model-free approaches, specifically
Q-Learning.
Most RL techniques rely (explicitly or implicitly) on a quality function Q(s, a), defined as:
Q(sk, ak) = E[rk+1 + γVπ(sk+1)| ak] (1.9)
=
X
sk+1 ϵ S
P(sk+1 | sk, ak)(rk+1 + γVπ(sk+1)) (1.10)
12
From which the optimal policy π(s, a) can be extracted as:
π(s, a) = arg max
a
Q(s, a) (1.11)
V (s) = max
a
Q(s, a) (1.12)
Intuitively, the quality function captures the best (discounted) expected future rewards for all actions, given
the starting state. Notice in Equation 1.9 that a recursive form of the value function Vπ(s) is used. In fact,
by writing:
V (s) = E(r
′ + γV (s
′
)) (1.13)
it becomes clear that r
′+γV (s
′
)is an unbiased estimator for V (s). This formulation leads directly into one
of the simplest model-free RL algorithms: temporal difference (TD) learning. In TD learning, an estimate
of Vπ is maintained for a given policy π. When an experiment is conducted from state s (resulting in
transition to s
′
and reward r), the estimate of Vπ is updated as follows:
V
new
π
(s) = V
old
π
(s) + α(r + γV old
π
(s
′
) − V
old
π
(s)) (1.14)
where α ϵ [0, 1] is a pre-selected learning rate that weights the importance of recent experience in the
learning process. By defining the expression RΣ := r + γV old
π
as the TD target, the parenthesized term
(r+γV old
π
(s
′
)−V
old
π
(s)) = RΣ−V
old
π
(s) can be interpreted as the TD error. Then equation 1.14 is simply
a correction of the value function based on the TD error, and weighted by the learning rate α.
As a short aside, it should be mentioned that equation 1.14 is the 1-step look ahead TD (i.e. TD(0))
update equation, and that a more general n-step look ahead (TD(n)) equation also exists in which the TD
target is written:
R
(n)
Σ = rk + γrk+1 + · · · + γ
n
rk+n + γ
n+1Vπ(sk+n+1) (1.15)
13
which (in the limit that n + 1 is the full length of the experiment) converges to a Monte Carlo learning
algorithm.
A natural progression from the 1-step look ahead TD algorithm is Q-learning, an RL strategy developed
in 1989 [39] that attempts to learn the quality function Q directly. The algorithm maintains an estimate
of Q(s, a), which it updates as the RL agent continues to explore through trial and error experimentation.
Specifically, if at time k the RL agent takes action ak from state sk, resulting in a state transition to state
sk+1 and a received reward of rk, then Q(sk, ak) is updated as follows:
Q
new(sk, ak) = Q
old(sk, ak) + α(rk + γ max
a
Q(sk+1, a) − Q
old(sk, ak)) (1.16)
This is the Bellman equation, which corrects the existing estimate of Q(s, a) after each experimental trial
according to the TD error associated with the best available action from the new state sk+1. This is the same
idea that is leveraged in TD(0) learning, however notice that in equation 1.14, the TD target is evaluated
according to a fixed policy π that is not yet optimized. In this way, Q-learning is described as an offpolicy RL algorithm. Practically, this means that during training it is admissible to explore with suboptimal actions but still update Q using the optimal action. Perhaps more importantly, it means that it is
possible to learn from previous data and experiments, even those conducted with actions taken according to
another policy. In the context of a medical setting, a chemotherapy policy could be trained using q-learning
by analyzing data on patient response to treatment without the need to implement experimental
treatment policies which may endanger the patient’s wellbeing. For this primary reason, the present
work examines the use of Q-learning for developing chemotherapy schedules.
14
Given sufficient exploration time and a partly random exploration policy, Q-learning has been proven
to converge to an optimal action policy. [38][23]. Further details on the implementation of this algorithm
are presented in section 2.2.
15
Chapter 2
Methods
2.1 Model Formulation
In this section, a mathematical model of cancer evolution is presented for the purpose of training and
evaluating chemotherapy policies. Though adapted for the application at hand, this model is strongly
based on the model introduced by Martin Nowak in Chapter 7 of [27]. The developed model is a stochastic
evolutionary game set in discrete time. It takes the form of a birth-death process between three competing
cell subpopulations:
• S, the universally sensitive cell type.
• R1, resistant to C2 but sensitive to C1.
• R2, resistant to C1 but sensitive to C2.
As discussed in section 1.1, this work investigates the relationship between chemotherapy use and the
evolution of chemo-resistance. Here, C1 and C2 represent two different chemotherapeutic drugs that are
effective against the sensitive cell type S, but to each of which one of the cell types R1 or R2 express a
phenotypic resistance.
In this model, the system state is given by the vector ⃗x = (x0, x1, x2), where x0, x1, and x2 are the
current population of S, R1, and R2 -type cells respectively. The total population given by x0 + x1 + x2
16
is fixed and denoted N. Thus the state space for this system is given by SN = {(x0, x1, x2) ϵ N
3
: x0 +
x1 + x2 = N}. For ease of notation, S, R1, and R2 type cells will sometimes be called 0-, 1-, or 2-type
cells, or i-type in the general case. Note that the state space SN can be represented as a triangular simplex
in two dimensions, where vertices represent the points ⃗x = (N, 0, 0), ⃗x = (0, N, 0), and ⃗x = (0, 0, N),
faces/edges represent the points where one species is extinct (i.e. the set of states {⃗x = (x1, x2, x3) ϵ S :
xi > 0, xj > 0 for some i ̸= j}), and the interior of the simplex represents states where all cell types are
extant, i.e. the set {⃗x = (x1, x2, x3) ϵ S : xi > 0 ∀ i}.
To quantify interactions between the three cell subpopulations, a payoff matrix is given by
M =
S R1 R2
S a b c
R1 d e f
R2 g h i
=
S R1 R2
S 2.0 2.8 2.8
R1 1.5 2.1 2.1
R2 1.5 2.1 2.1
(2.1)
When two cells in this model interact with one another, this matrix describes the payoff that each cell
receives, depending on its own type and its competitor’s type. Here, row labels reflect the cell type of the
individual receiving the reward, and columns reflect the cell type of the competitor. For example, an S type
cell interacting with a R1 type cell would receive a payoff of 2.0 while the R1 cell receives a payoff of 1.5.
The structure of this payoff matrix was chosen to establish a prisoner’s dilemma (PD) game between S and
R1 and between S and R2. That is, b > e > a > d, and c > i > a > g. A PD is a domination-class
game, characterized by the fact that regardless of the opponent’s strategy (i.e. cell type), the greatest payoff
is received by the strategy (cell type) S. At the same time, the payoff received by two interacting cells of
type R1 (or R2) is greater than that received by two interacting S type cells. Thus, intuitively, the game
naturally encourages evolution of S type cells even though a system saturated with S cells receives less
total payoff than a population saturated with resistant cells. Further details on the choice of a PD game
17
for this model can be found in section 1.1.4. In addition to these PD games, a neutral game is established
between R1 and R2, defined by the relation e = f = h = i, to impose a symmetry relation among these
species and avoid unnecessary complicating factors in understanding the evolution of this system.
It is assumed that every cell in the system has an equal probability of interacting with every other cell.
Given the payoff matrix M and population state ⃗x, the expected payoff Fi received by a cell of type i can
then be computed:
Fi =
X
j
xj
N
Mi,j (2.2)
where xj
N
is the probability that the focal cell interacts with a j-type cell and Mi,j is the i, jth entry of M,
corresponding to the reward the cell would receive against a cell of type j.
In nature, survival and reproduction is not always a pure function of selection: random, neutral factors
are an important component of evolution. To capture the intensity of selection in this model, a parameter
⃗w = (w0, w1, w2) ϵ [0, 1]3
is introduced, with elements wi corresponding to the selection pressure imposed
on cells of type i. Fitness fi of a cell of type i is then defined as:
fi = 1 − wi + wiFi
(2.3)
Thus as wi → 1, fi → Fi and selection strongly governs evolution. On the other hand, as wi → 0,
selection becomes nonexistent and the evolutionary dynamics of the system approach neutral drift.
Given the computations for fitness presented in equations 2.2 and 2.3, the birth-death process proceeds
as follows. At each time step, one cell is chosen for elimination and another for reproduction. Both choices
18
are random and independent; the cell to be eliminated is chosen with uniform probability, while the cell
to reproduce is chosen with probability weighted by fitness. Specifically:
pi,e =
xi
N
(2.4)
pi,r = P
xifi
j
xjfj
(2.5)
where pi,e and pi,r are the probabilities that a cell of type i is chosen for elimination or reproduction,
respectively. Once the cells have been chosen, the population of the eliminated cell is decremented by one,
and that of the reproducing cell is incremented by one. Thus the total population remains fixed. In the
case that the eliminated and reproducing cells are of the same type, the system state does not change.
A few remarks should be made here. Firstly, the subpopulation values can only increase/decrease by
at most one cell at each time step. Thus in the context of a Markov process, each state has at most seven
neighboring states (achieved for all interior states of the simplex representing SN ). Moreover, since the
population values are updated only after choosing both the cell to be eliminated and to reproduce, these
choices are independent. Therefore, transition probabilities in this Markov process can be computed as
follows: suppose that state ⃗x1 is on the interior of the simplex and state ⃗x2 is such that from ⃗x1 to ⃗x2, the
value of xi decreases by one and the value of xj increases by one. Then the transition probability T(⃗x1, ⃗x2)
from state ⃗x1 to ⃗x2 is given by:
T(⃗xk, ⃗xk+1) = pi,e pj,r (2.6)
=
xi
N
P
xjfj
l
xlfl
(2.7)
In the case that ⃗x1 = ⃗x2,
T(⃗xk, ⃗xk+1) = X
i
pi,e pi,r (2.8)
=
X
i
xi
N
P
xifi
j
xjfj
(2.9)
19
Secondly, note that if xi = 0 for some i ϵ {0, 1, 2}, (i.e. the state is on an edge of the simplex) then
from equation 2.5 the probability that a cell of type i reproduces is zero. This matches physical intuition:
the cell type i has become extinct, and the system evolves according to a two-strategy evolutionary game
between the remaining cell types. Similarly, if xi = N for some i ϵ {0, 1, 2} then xj = 0 for all j ̸= i. From
equations 2.4 and 2.5, a cell of type i will be chosen for both reproduction and elimination with probability
1. Thus the system has three fixed points, which occur when the population becomes saturated by a single
cell type (i.e. at vertices of the simplex). As a consequence, in contrast to interior states of the simplex
which have seven neighboring states, vertex states have only one neighboring state (themselves, which
they transition to with probability 1), and edge states have three.
2.1.1 Chemotherapy
Thus far, chemotherapy application does not factor into the evolution of the system. Define ⃗c(t) =
(c1, c2) ϵ [0, 1]2
to be the chemotherapy dose administered to the system. Note that this is a function
of time, t, and that the elements c1 and c2 correspond to the doses of drugs C1 and C2. Recall that the
intensity of selection ⃗w = (wi)
2
i=0 is involved in computation of fitness via equation 2.3. To introduce
chemotherapy into the model, write ⃗w = ⃗w(⃗c, so that chemotoxin concentrations can alter the selection
pressure imposed on each cell type. Specifically,
• w0 = w − w max(1, c1 + c2)
• w1 = w − w c2
• w2 = w − w c1
Where w is a scalar representing a universal selection pressure in the absence of chemotherapy (for the
purpose of this work, w = 1). This formulation of ⃗w(⃗c encodes the resistance profile of each of the cell
types S, R1, and R2, so that chemotoxin concentration inversely affects the selection pressure of only
non-resistant cell species. Since the expected payoff of any cell type, regardless of the population state, is
20
at least 1.5 (by the payoff matrix 2.1 and equation 2.2), any decrease in the selection pressure (as a result of
increase in effective chemotherapy concentration) corresponds to a decrease in fitness. In the context of an
MDP (discussed in section 1.2) the space of allowable chemotherapy concentrations ⃗c is the action space
A. By allowing ⃗c to be a function the system state ⃗x, the control policy π(⃗x, a) is given by the indicator
function π(⃗x, a) = I{a = ⃗c(⃗x)}.
2.2 Q-Learning Implementation
The Q-learning algorithm is introduced in section 1.2.2. The current section elaborates on this topic,
presenting details on the use of this algorithm to develop optimal chemotherapy schedules.
2.2.1 Entropy
Recall that the objective of this work is to use reinforcement learning to design chemotherapy schedules
that avoid the evolution of chemo-resistance. Recall from the payoff matrix 2.1 that S, the sensitive cell
population, receives the greatest payoff and will therefore outcompete R1 and R2 cells in the absence of
chemotherapy. This implies that simply setting ⃗c(t) ≡ (0, 0) for all t would allow the sensitive population to eventually saturate the system, preventing evolution of chemo-resistance. In a practical setting,
however, chemotherapy is a powerful tool in reducing the overall tumor mass, and it is not feasible to discontinue treatment altogether. At the same time, if chemotherapy is administered too aggressively, the S
population will become extinct and the resistant populations will invade the system by competitive release
(see section 1.1.3). Instead, it is desirable to apply chemotherapy in such a way that all three subpopulations
S, R1, and R2 remain extant.
In the context of the model presented in section 2.1, this objective translates to keeping the system state
on the interior of the simplex representing the state-space SN , since the edges and vertices of the simplex
21
are terminal, and represent at least one extinct species. Moreover, since the model is stochastic and there
is an element of random drift, it is also desirable to stay far from the edges, so that "unlucky" events do not
inadvertently lead to species extinction. For example, suppose the state is ⃗x1 = (1, N/2, N/2−1) for some
even integer N. Although the S population is still extant, there is an 1
N
chance that the single remaining S
type cell is chosen for elimination on the next step. If the same cell is not also chosen for reproduction, the
S population will go extinct. By comparison, if the state is (for example) ⃗x2 = (4, N/2−2, N/2−2), then
four "unlucky" events (S eliminations) are necessary for the subpopulation to go extinct, which has a much
lower probability of occurrence. Regardless of chemotherapy policy, ⃗x1 is far more likely to transition to
the edge of the simplex in the near term than ⃗x2. In this way, as the state ⃗x approaches the edge of the
simplex, the chemotherapy policy loses agency in preventing species extinction.
To abate this, the control target for the designed chemotherapy policy is chosen to be maximization
of species diversity, which pushes the system state towards the very center of the simplex. Quantitatively,
this is captured by a maximization of entropy, given by:
H(⃗x) = −
X
i
xi
N
log xi
N
(2.10)
This notion of entropy is analogous to that of Shannon entropy in information theory, [9] and has more
recently demonstrated relevance in describing tumor complexity and heterogeneity. [43][26][41] It is perhaps best understood in the sense of statistical mechanics; as a logarithmic measure of disorder or uncertainty, or (in the context of this work) as the number of system states reasonably likely to become occupied.
A plot of the entropy for the simplex representing S100 is provided in Figure 2.1
In summary, using the measure of entropy provided in equation 2.10 to quantify cell heterogeneity
allows for a redefinition of the objective of this work: to design chemotherapy schedules that can achieve
and maintain maximal entropy states.
22
Figure 2.1: Plot of H(⃗x) on the simplex describing the state space S100
2.2.2 Reward Structures
Recall that an MDP is defined by the 4-tuple (S, A, T, R. The first three elements have already been introduced as the state space, action space, and transition probability function, and their importance has been
made clear. The last element, the reward function R is necessary to quantify the value Vπ(
⃗(x)) of each
state and the quality Q(
⃗(x), a) of each state-action pair.
Except in specific contexts, choosing a reward function R(s, s′
, a) can be a tricky endeavor, and is not
a closed problem. In the context of this work, there doesn’t seem to be any strong reason to distinguish
rewards based on the action a, nor the origin state s. The more important variable is the destination state
s
′
. This greatly simplifies the problem, allowing the reward function to take the form of a lookup table that
assigns a scalar reward to the Q-learning agent at each time step based solely on the system state achieved.
This table is called a reward structure and can be represented as a heatmap on the simplex.
23
Since the objective function for this work is defined as maximizing entropy (per equation 2.10), an
intuitive option might be to assign rewards at each state equal to the entropy of this state. However, notice
that the entropy gradient of the state space (Figure 2.1) is very shallow near the center of the simplex,
and only drops off near the vertices. In fact, states directly adjacent to the midpoint of each edge achieve
entropy values as high as 0.74, whereas the maximum and minimum entropy values in the simplex are 1.10
and 0.11. Such states are extremely precarious and liable to progress to species extinction, and it is obvious
that the reward structure should reflect this by assigning them strict penalties (i.e. negative rewards),
rather than a moderate positive reward. Thus, it is not viable to directly use entropy values as a reward
structure. However, rewards may still be strongly based on entropy. The candidate reward structures that
were ultimately chosen are presented in Figure 2.2:
(a) Two-Region (b) Three-Region (c) Four-Region
Figure 2.2: Three reward structures, differentiated by an increasing number of reward regions. The
legends indicate numeric reward values assigned to each region.
These each assign a strong reward (+1) to the states that achieve an entropy value within 0.05% of the
maximum value, and a punishment (reward of −1) to states near the exterior. The structures differ in that
those presented in figures 2.2b and 2.2c add small buffer regions to the reward structure in figure 2.2a: the
tradeoff here is that a structure with no reward more strictly incentivizes staying at high-entropy states,
while structures with a small surrounding gradient provide the learning agent with a target that is easier
24
to find. To evaluate the importance of reward structure, chemotherapy policies were trained using all three
candidate reward structures.
2.2.3 Training
In the context of reinforcement learning, training is the process of identifying the quality function Q(⃗x, a).
The form of this function is another lookup table, assigning a numeric value to each allowable action a ϵ A
at every state ⃗x ϵ S.
The total population size N was fixed to 100. This was not an arbitrary choice. Small values for N
result in a a small state space SN , in which stochastic simulations of the model presented in section 2.1 are
highly susceptible to random drift. In contrast, when N is large, simulations can better exhibit selection
dynamics of the population. In fact, in the limit that N → ∞, the evolutionary dynamics of this system
converge to a form of the deterministic replicator equation presented in section 1.1.5. In this work, the
stochastic behavior of the model is a feature of interest and exceptionally large state spaces do not capture
this behavior at all. On the other hand, the vulnerability of small state spaces to random drift makes them
difficult to control. The value N = 100 was chosen as a middle ground, where stochastic phenomena
are still observable but do not overwhelmingly dominate the evolutionary dynamics. Importantly, this
value for N was also sufficiently small to allow for reasonably quick computational convergence of the Q
estimate (typically within 24 hours on the author’s personal computer). It is easy to see that the size of the
state space S is given by |SN | =
PN
i=1 i, so that |S100| = 5050.
The action space A is defined as the space of allowable actions, i.e. the space of allowable chemotherapy
concentrations ⃗c ϵ [0, 1]2
. Each additional action included in A linearly increases the computational load
associated with learning the quality function, Q. For this reason, the action space is defined as A :=
{(0, 0),(0, 1),(1, 0)}, for a total of just three admissible actions. Recall that ⃗c := (c1, c2), where c1 and c2
25
are the concentrations of the drugs labelled C1 and C2, respectively. The selected action space therefore
features one action to favor each of the three cell subtypes S, R1, and R2, and thus was determined to be
the simplest action space that still provides sufficient agency in controlling the population’s evolutionary
trajectory.
Training proceeded according to the pseudocode in algorithm 1, where in each step of the inner for
loop, ⃗x ′
is the state that the system transitions to as a result of taking action ⃗c from state ⃗x. Importantly,
this transition is simulated using the model provided in section 2.1. Consequently, there is a degree of
randomness in this transition; the same action ⃗c from the same state ⃗x will not always result in the same
new state ⃗x ′
. The reward received for achieving this state is given by r. From these observed values, the
estimate for Q is updated in line 7 according to the Bellman equation (see equation 1.16).
Algorithm 1 Q-Learning
1: procedure Training
2: Q ← all zeros
3: while not converged do
4: for each allowable state ⃗x ϵ S do
5: for each allowable action a ϵ A do
6: observe ⃗x ′
, r
7: Q(⃗x, ⃗c) ← Q(⃗x, ⃗c) + α(r + γ max
⃗c ∗
[Q(⃗x ′
, ⃗c ∗) − Q(⃗x, ⃗c)])
This is fairly standard implementation of Q-learning, but there are some details of algorithm 1 worth
noting. Firstly, in line 2, the estimate of Q(⃗x, ⃗c) is initialized with the value zero for all ⃗x, a. Considering
that the reward at each state ranges from −1 to +1, this is a neutral estimate that confers no initial bias
to the learning agent. Secondly, line 3 controls the duration of the training period. While absolute convergence of Q-learning is guaranteed in the limit of infinite training cycles, [38][23] it was found through
early experimentation that the incremental benefit of additional training time begins to diminish severely
after the first few tens of thousands of training iterations. This is exemplified in Figure 2.3, which shows
the absolute change in Q-value over all state-action pairs versus the number of training cycles.
26
Figure 2.3: Net change in Q for all state-action pairs as a function of training cycles.
In the interest of mitigating computational burden of the training process, training was terminated after
100, 000 iterations of the outer while loop, which was found to be sufficiently long for generating effective
chemotherapy schedules (see section 3.1). A more accurate pseudocode representation of algorithm 1
might therefore read: "for 100, 000 steps do" in place of line 3, but the original is retained for the sake of
technical rigor.
With regards to the learning rate α and discount factor γ, relatively standard values of α = 0.05 and
γ = 0.9 were chosen. Preliminary testing discovered that these values resulted in stable and relatively
rapid convergence of successful chemotherapy policies, and further adjustment to these parameters was
not explored.
27
As a final comment, note that under algorithm 1, within each iteration of the while loop the algorithm
performs one experimental trial of each action at every state. This is a synchronous approach to training,
as opposed to an asynchronous approach such as that described in algorithm 2.
Algorithm 2 Asynchronous Q-Learning
1: procedure Training
2: Q ← all zeros
3: while not converged do
4: ⃗x ← random s ϵ S
5: while ⃗x is on simplex interior do
6: choose vecc
7: observe ⃗x ′
, r
8: Q(⃗x, ⃗c) ← Q(⃗x, ⃗c) + α(r + γ max
⃗c ∗
[Q(⃗x ′
, ⃗c ∗) − Q(⃗x, ⃗c)])
In the asynchronous version, in each iteration of the outer while loop, ⃗x is randomly initialized and the
training algorithm follows its evolution, updating values of Q(⃗x, ⃗c) for each state-action pair only when
that pair is encountered. There are different ways to choose the action vecc in line 6, but most commonly
this is done by a function which selects the best (highest quality) action most of the time, but has a small
probability of choosing a random (uniformly selected) action from A to force occasional exploration. Naturally, the asynchronous approach better mirrors many real-life scenarios where stochastic realizations of
the entire process can be observed. It also has the advantage that a greater proportion of the overall training trials will be conducted from states that are commonly accessed. The converse to this is also true: the
Q estimate of state-action pairs that are rarely accessed is rarely updated, resulting in poor convergence
of the estimate at these locations. In the context of this work, where the goal is to keep the state away
from the edge of the simplex, this means that the estimates of Q at the most precarious states (those near
the edge of the simplex) converge poorly. Arguably, these are the states where choosing a correct action is
most important, and consequently it was decided that a synchronous approach to training is better suited
to this problem.
28
Chapter 3
Results
3.1 Learned Policies
The reinforcement learning strategies discussed above were used to produce three different policies for
administering chemotherapy, each corresponding to one of the reward structures presented in Figure 2.2.
The policies are presented in Figure 3.1.
(a) Two-Region (b) Three-Region (c) Four-Region
Figure 3.1: Three learned policies, each trained using a different reward structure but otherwise identical
training parameters.
Each policy prescribes a chemotherapy dose (i.e. an "action") based solely on the current state (or subpopulation balance) of the system. These policies are constructed by choosing, at each state ⃗x ϵ S, the action ⃗c
that maximizes the learned quality function, Q(⃗x, ⃗c). In the notation of section 1.2,
29
π(s, a) :=
1 if a = arg max
a∗
Q(s, a∗)
0 else
where s and a correspond to the state ⃗x and action ⃗c and π(s, a) is the action policy. For the remainder
of this work, these policies will be addressed by the reward structure used to train them. For example, the
policy presented in Figure 3.1a will be called the "two-region policy," or similar.
An immediate observation is that the three learned policies are qualitatively similar: they divide the
simplex on S into roughly six distinct zones, where each corner seems split between two of the three
available actions, and the remaining zones opposing the corners each prescribe a single action. Figure 3.2
shows the three-region optimized policy from Figure 3.1b with a dividing line drawn across the N/3 axis
for each subpopulation, which helps to distinguish these six zones.
Recall section 2.1 on the definition of cell fitness and notice that each of the available actions clearly
favors one of the three subpopulations. ⃗c = (0, 1) is favorable to the R1 population since both the S and
R2 populations are vulnerable to C1. By symmetry of the resistant populations, ⃗c = (1, 0) is favorable
to the R2 population. Finally, ⃗c = (0, 0) favors the S population since in the absence of chemo, S is the
dominant cell type (see the payoff matrix in Equation 2.1). Another way to view this fact is that each cell
type can be targeted by two of the three available actions (those favoring the remaining two cell types).
With this in mind, there is a clear intuition to the six-zone division highlighted in Figure 3.2. In the zones
opposing each corner, only one of the subpopulations is at risk of extinction, and the prescribed action
is the one that favors the at-risk cell type. On the other hand, in each corner zone there is one cell type
that is close to fixation, and the appropriate response is to strike a balance between the two actions that
target this cell type. In a sense, the three actions can be thought to "push" the system state in a particular
30
Figure 3.2: The optimized 3-region policy can be roughly divided into six regions. A similar pattern is
present in the optimized 2- and 4- region policies.
direction on the simplex (see the legend of Figure 3.2), and the learned policies prescribe actions to drive
the state towards the simplex center.
While the learned policies shown in Figure 3.1 have the same general six-zone structure, they differ
in the bounds of each zone. As the underlying reward structure used to train the policy becomes stricter
(i.e. reduction in the number of reward regions, with a two-region structure being strictest), the size of the
zone opposite each corner shrinks. The following sections will investigate the general performance of the
six-zone structure while also considering the impact of these differences.
31
3.2 Most Probable Trajectories
The stated goal of this work is to develop chemotherapy schedules that can achieve and maintain highentropy states, per Equation 2.10 and Figure 2.2.1. The highest entropy states were identified when constructing the two-region reward structure (Figure 2.2a). Hereafter, these states (those assigned a reward of
1 in Figure 2.2) will be termed the "target" or "target region," and the ability of any chemotherapy policy
to drive the state ⃗x towards this target will be used as a metric of its performance.
In this section, the locally most probable state trajectories are computed for each policy, for a variety
of initial conditions ⃗x0. These trajectories are computed as follows: given the initial state ⃗x0, an action
⃗c is chosen according to the policy π(⃗x, ⃗c) under investigation. The transition probability T⃗c(⃗x0, ⃗x ′
) to
each neighboring state ⃗x ′
is calculated, and the system is taken to transition to ⃗x1 = arg max
⃗x ′
T⃗c(⃗x0, ⃗x ′
),
the state corresponding to the highest transition probability. From here the process repeats to determine
⃗x2, ⃗x3, . . . until the trajectory reaches some predetermined length.
A few remarks are in order. The process identifies what is dubbed the "locally" most probable trajectory
as it only seeks to maximize the one-step transition probabilities between states. It is possible that another
trajectory is globally more probable, but this becomes difficult to compute considering the size of the state
space S and the far greater size of the trajectory space. Moreover, when identifying neighboring states
⃗x ′
as candidates for transition from state ⃗xt at time step t, the states ⃗xt and ⃗xt−1 are excluded. Recall
from Section 2.1 that T⃗c(⃗xt
, ⃗xt) is always strictly positive, and T⃗c(⃗xt
, ⃗xt−1) is strictly positive if ⃗xt
is on
the simplex interior (i.e. not on an edge or vertex). Thus these transitions are possible by the model,
however allowing them will occasionally yield infinite loops in the computed trajectory. Specifically, if
the transition ⃗xt+1 = ⃗xt
is allowed then for all t
′ ≥ t, ⃗xt
′+1 = ⃗xt
′ = ⃗xt
, and similarly if ⃗xt+1 = ⃗xt−1
is allowed then ⃗xt+2 = ⃗xt
(since we know that ⃗xt
is the most likely transition from ⃗xt−1 = ⃗xt+1, and
so the system oscillates between these two states indefinitely. This infinite loop is unhelpful for analysis,
32
and is furthermore not an interesting feature of the evolutionary process since the stochastic nature of
this system will eventually push the state out of any infinite loop. Consequently, these states are excluded
from consideration when computing the most probable trajectory.
Trajectories are also sometimes called "paths." Figure 3.3 shows the locally most probable paths taken
by each of the learned chemotherapy policies from each of the initial conditions ⃗x0 ϵ {(80, 10, 10), (10, 80,
10), (10, 10, 80), (10, 45, 45), (45, 10, 45), (10, 10, 45)}. The initial conditions (ICs) were chosen to provide a representive set of somewhat unfavorable prognoses: each of the six zones identified in the previous
section is represented by one IC, and the ICs all feature at least one subpopulation within ten members of
extinction.
(a) Two-Region (b) Three-Region (c) Four-Region
Figure 3.3: Locally most likely trajectories from six different initial conditions, for each of the learned
chemotherapy policies. The paths are superimposed on the policy maps, which have been recolored for
clarity.
The computed trajectories are all able to guide the system state towards the desired target (in the
figure, paths are truncated upon reaching the target for clarity). This is one metric of success. Some other
metrics are also relevant, such as the number of steps it takes to reach the target, and the total amount
of chemotherapy used along the way. A quicker response time is of course favorable — in a stochastic
setting, a policy that reaches the target more quickly spends less time at lower-entropy states, and is
therefore less liable to see extinction via random drift. Additionally, while chemotherapy is effective at
combating cancer it is also known to have significant deleterious effects on patient health. [32] A policy
33
that manages to minimize the required dosage of chemotherapy while still avoiding evolution of resistance
would greatly improve patient health and quality of life. While this is not something that is captured by
the implementation of the Q-learning algorithm, it is still a relevant metric of performance. Figure 3.4
indicates the length, in steps, of each of the trajectories shown in Figure 3.3, and Figure 3.5 shows the total
amount of chemotherapy used by the trajectories.
Figure 3.4: Distance of the most likely path from each of six initial conditions, for each of the three
learned chemotherapy policies.
The different policies were ultimately found to perform very similarly in all metrics. Considering the
qualitative similarities in their structure, this does not come as a surprise.
34
Figure 3.5: Total chemotherapy dosage used to reach target along the most likely paths from each of six
different initial conditions, and for each of the three learned chemotherapy policies.
3.2.1 Robustness
Another relevant aspect of these policies is their sensitivity to uncertainty in initial condition. The previously presented results assume that perfect knowledge of the cell subpopulation balance at the start of
treatment is possible, but this is unrealistic in a clinical setting. In this section, the ability of a prescribed
action sequence (derived from a learned policy) to guide the system from an uncertain state to the target
region is quantified. No significant differences were identified between the three learned policies, and so
for clarity the data presented here is restricted to that generated using the two-region policy.
For a given initial condition ⃗x0 and chemotherapy policy π, a locally most probable trajectory is
computed using the method described in the previous section. This trajectory is a sequence of states,
35
⃗x0, ⃗x1, ⃗x2, . . . By following this trajectory alongside the policy π, it is possible to extract an action sequence ⃗c0, ⃗c1, ⃗c2, . . . where ⃗ct
:= ⃗c s.t. π(⃗xt
, ⃗c) = 1. In other words, the action sequence (denoted ⃗a) is
the sequence of actions that would be prescribed by the policy when following the identified most probable trajectory. As a measure of robustness, this action sequence is then applied again from a different
initial condition ⃗x0
′
, to see how well the policy fares when the IC cannot be accurately determined before
treatment. Specifically, from this new IC, another most probable trajectory is determined by taking the
most probable next steps from each state in the path, starting at ⃗x0
′
, given that actions are taken according to the action sequence ⃗a. Figure 3.6 shows a handful of these trajectories. Here, the action sequence
is computed from the initial condition ⃗x0 = (10, 10, 80), towards the bottom-right of the simplex, and is
applied to a range of initial conditions throughout the simplex.
Of course, the action sequence performs best when applied from initial conditions ⃗x0
′
that are at or
close to ⃗x0. The path from ⃗x0
′ = (25, 10, 65)(orange) actually manages to reach to target region, while the
path from ⃗x0
′ = (25, 10, 65) (blue) narrowly misses. In contrast, paths from completely different zones of
the simplex of course miss the target completely. The data presented in Figure 3.7 quantifies this effect.
Each curve on this plot represents a different action sequence, computed from a different initial condition
⃗x0 = (x0,0, x1,0, x2,0). At each data point, the identified action sequence is applied from every nearby
initial condition ⃗x0
′ = (x0,0
′
, x1,0
′
, x2,0
′
) within a specified distance of the IC ⃗x0. Here, distance is
computed as:
d(⃗x0, ⃗x0
′
) = max
i
|xi,0
′ − xi,0| (3.1)
In other words, it is the maximum difference in the size of any individual cell subpopulation between
the two states. For each ⃗x0, this produces a series of new trajectories. Figure 3.7 indicates what proportion
of these trajectories were able to reach the target. In general, it was found that small deviations (< 3 in
36
Figure 3.6: The most likely paths from each of six initial conditions achieved by following the action
sequence associated with the path from ⃗x0 = (10, 10, 80).
initial condition did not affect the ability of the extrapolated path to reach the target. It should be noted that
the red curve corresponding to the IC ⃗x0 = (45, 45, 10) is hidden behind the purple curve corresponding
to ⃗x0 = (45, 45, 10). This is due to the symmetry between the R1 and R2 cell types (see Equation 2.1
and section 2.1.1, and the fact that the action sequences generated from these ICs uniformly prescribe the
action ⃗c = (1, 0) or ⃗c = (0, 1) (see Figure ??). Interestingly, the policy is more robust in certain zones of
the simplex than in others. At the initial condition ⃗x0 = (80, 10, 10) near the S corner of the simplex, the
policy exhibits its least robust behavior (green line in Figure 3.7). In the opposing zone directly below the
target, the policy appears most robust. Here, every trajectory starting within six units of the of the initial
condition ⃗x0 = (10, 45, 45) is able to reach the target.
37
Figure 3.7: Proportion of paths with initial conditions ⃗x0
′ within the specified uncertainty range from
the initial condition specified by ⃗x0 that are able to reach the target region, given that the path follows
the action sequence associated with the path starting at ⃗x0. Each color corresponds to a different IC ⃗x0
This behavior is explained as follows: near the S corner of the simplex, both of the resistant populations
are at risk of extinction and the only actions prescribed by the chemotherapy policy are to apply C1 or
apply C2, each of which targets one of the endangered populations. "Mistakes" (i.e. mismatches between
the actions prescribed by the policy and the action sequence) made in this zone are far more likely to result
in extinction than a similar mistake made elsewhere in the simplex. In contrast, in the zone below the
target, both resistant populations are healthy and the the most probable trajectory from ⃗x0 = (10, 45, 45)
yields an action sequence where chemo is never administered (see Figure 3.3). Mistakes in action are less
likely here since the chemotherapy policy mostly prescribes ⃗c = (0, 0) in this zone, and furthermore any
mistakes that do occur are less likely to result in extinction.
Despite a clear discrepancy in robustness among the different ICs ⃗x0, even in the worst case 40%
of paths originating from within a 10-distance neighborhood of the IC from which action sequence was
38
derived were able to reach the target region. This neighborhood of radius 10 corresponds to 270 unique
states. By comparison, the target region itself contains 42 states, with a maximum distance of 7 between any
two points in this region (i.e. a radius of less than 4). This is optimistic for the policy, demonstrating that
even with a large degree of uncertainty in initial condition, a sizeable portion of most probable trajectories
will still reach the target.
3.3 Stochastic Trials
In this section numerical experiments are performed to investigate evolutionary dynamics associated with
the learned chemotherapy policies. While computing probable trajectories is useful for analysis, these simulations are able to expose stochastic phenomena that may otherwise be difficult to identify. To introduce
the simulations, Figure 3.8 shows two sample paths generated using the learned two-region chemotherapy
policy, each from a different initial condition.
Figure 3.8: Two sample stochastic paths, with initial conditions ⃗x = (1, 57, 42), left, and ⃗x = (68, 4, 28),
right. The initial condition for each path is the dark blue tail, and the end point is the yellow head of the
path.
39
These trajectories are simulated using pseudo-random number generation to select, at each time step, one
cell for reproduction and another for elimination, per the model introduced in section 2.1. This process
advances the trajectory one step in time. Unlike in the previous section, where certain transitions were
forbidden, no such restrictions are imposed on this simulation — transitions are governed only by the
model.
The paths in Figure 3.8 each contain 5000 steps. Both begin near the edge of the simplex, and traverse
the state space to reach the target. The first path meanders somewhat on it’s way to the target, but remains
tightly within the target upon arrival region. The second path beelines for the target, but later experiences
some amount of downward drift before slowly correcting itself and returning to the target region. While
these are only two stochastic realizations of a trajectory, they demonstrate a key behavior that is not
captured by the analysis of the previous section.
In the previous section it was discussed that the three allowable chemotherapy dose each favor one cell
type over the others. While this is true, the degree to which each cell type benefits under its favorable dose
is not equal. To demonstrate this, consider the following three situations: (1) ⃗x = (10, 45, 45), and the
action ⃗c = 0, 0 is chosen ; (2) ⃗x = (45, 10, 45), and the action ⃗c = 0, 1 is chosen; and (3) ⃗x = (45, 45, 10),
and the action ⃗c = 1, 0 is chosen. In each case, one cell type is at risk of extinction and the chosen
action is the one that favors the endangered cell type. The situation appears symmetric, however the
fitness vector ⃗f = (f0, f1, f2) for each case can be computed (via the model presented in section 2.1 as (1)
⃗f = (2.72, 2.04, 2.04); (2) ⃗f = (1.0, 1.83, 1.0); and (3) ⃗f = (1.0, 1.0, 1.83). Thus the ratio of the fitness
of the favored cell type to that of the targeted cell types in each of the cases is (1) 1.33; (2) 1.83; and (3)
1.83. In case (1), the S type population is favored over its competitors by a slimmer margin than in cases
(2) and (3). It is still more likely for an S type cell to reproduce than it is for either of the other species,
but the difference in probabilities is less than that encountered in cases (2) and (3). This means that in
40
a stochastic setting, it is more difficult to drive the system state towards the S corner using the action
⃗c = 0, 0 than to either of the other corners by using the actions ⃗c = 0, 1 or ⃗c = 1, 0. Intuitively, the effect
of chemotherapy against sensitive cells is far stronger than the effect of natural selection against resistant
cells. This is not necessarily a natural phenomenon, but rather an interesting feature of the model, based
on its implementation. The consequence of this feature is that trajectories originating in the bottom half
of the simplex are more prone to meander than those in the upper half, and random drift is more likely to
result in extinction of the S population than either of the resistant populations.
It should also be noted that in reproductions of the experiments showcased in Figure 3.8, the system state did occasionally traverse towards the edge of the simplex, resulting in species extinction before
reaching the target. While results such as the ones presented are promising examples of the learned policy
achieving its goal, no policy can guarantee avoidance of resistance in a stochastic setting.
3.3.1 Simplified Policies
A possible flaw in the learned chemotherapy policies is that the actions it prescribes frequently alternate
from step to step. Though this is a highly simplified model that is far abstracted from any clinical setting,
it is certainly true that any clinical treatment plan could not achieve the fast action switches that these
policies call for. Moreover, the discussion of robustness presented in Section 3.2.1 seemed to imply that
policy performance is less affected by uncertainty in the initial condition when they feature large zones
prescribing uniform actions. To this end, it was desirable to investigate the possibility of constructing
simplified chemotherapy policies. This section presents two such options.
Recall from Figure 3.2 that the learned policies can be roughly divided into six zones. In the zones
opposite each corner of the simplex, the policies largely prescribe a single action, while in the corner zones
41
the policies prescribe a balance between a pair of actions. Figure 3.9 provides an alternative, simplified
chemotherapy policy that seeks to capitalize on this observation.
Figure 3.9: A coarse policy constructed by dividing the simplex into six regions and assigning uniform
chemotherapeutic actions throughout each region.
This policy will be called the "constructed" policy. Its six zones are created by dividing the simplex along
the N/3 axis of each subpopulation, and a uniform action is prescribed in each zone. Notably, in the
corner zones the prescribed action is a compromise between the two actions chosen by the learned policy.
For example, in the zone closest to the S corner, the learned policy prescribes the actions ⃗c = (1, 0) and
⃗c = (0, 1) in an approximately even ratio. In the constructed policy, this zone is therefore assigned the
action ⃗c = (0.5, 0.5).
Another approach taken to developing a simplified policy is to homogenize the policy, as follows. At
each state ⃗x ϵ S, the neighborhood of radius 3 (with distance determined by Equation 3.1) centered on the
state ⃗x is identified. The value of the quality function Q(⃗x, ⃗c) is then averaged over every other state in
42
the neighborhood, for each action. This average is denoted Q(⃗x, ⃗c). The action associated with the state
is then chosen by computing the ratios:
c1 =
Q(⃗x,(1, 0))
P
⃗c ϵ A Q(⃗x, ⃗c)
(3.2)
c2 =
Q(⃗x,(0, 1))
P
⃗c ϵ A Q(⃗x, ⃗c)
(3.3)
Each ratio is then rounded to the nearest multiple of 0.5, and the state is assigned the action ⃗c =
(c1, c2). Intuitively, this assigns to each state an action that is a rough average of the actions prescribed by
the learned policy in that state’s neighborhood. Because the neighborhood does not change significantly
between a state and its immediate neighbors, this process results in a homogenization of the chemotherapy
policy over large regions of the simplex, so that chosen actions at neighboring states match one other more
often. This policy will be called the "homogenized" policy, and is shown in Figure 3.10.
Figure 3.10: Another policy constructed from the 2-region fully optimized policy by averaging Q(⃗x, ⃗c)
among the neighbors of each point ⃗x ϵ S.
43
Though this policy is not as neat as the constructed policy, and is more complicated to define and
compute, it is directly based on the results of reinforcement learning. In contrast, the constructed policy
is constructed somewhat artificially, based on only on qualitative observations of the learned policies. In
the following section, these policies will be evaluated via stochastic trials to determine their effectiveness
relative to the learned policies.
3.3.2 Simulation Data
Before presenting data, it is necessary to introduce a new metric for robustness that will be investigated in
these stochastic trials. The metric presented previously in Section 3.2.1 does not directly translate well to
experimental paths, where trajectories are simulated over thousands of steps. Stochastic trials also include
random drift, which introduces another opportunity for uncertainty to arise.
In a clinical setting, the state of a tumor is often determined by conducting a biopsy. Biopsies can
be costly [40], invasive, and pose risks of complication, [46] and so the procedure can be conducted only
infrequently. This means that it is not possible to have perfect information of the system state at all times,
which makes it difficult to accurately follow any of the learned chemotherapy policies.
The ability of the policy to perform well even without continuous knowledge of the system state is a
form of robustness. To quantify this, a hidden state and visible state are defined. The hidden state is the true
state of the system and evolves at each step according to the system dynamics and choice of action, in the
same manner as the paths presented in Figure 3.8. However, knowledge of this hidden state is inaccessible
for the purpose of selecting a chemotherapeutic action. The visible state is a running estimate of the hidden
state, used to choose actions at each time step. It is assumed that a "biopsy" can be taken at the start of
treatment, and at every k steps thereafter. A biopsy taken at time t is assumed to reveal the hidden state
⃗xt at that time, and the visible state is updated to match it. At times t
′
ϵ {t + 1, t + 2 . . . t + k − 1, the
44
hidden state is unknown. To estimate it at these times, the visible state is computed as the most probable
trajectory with initial condition ⃗xt
, and total trajectory length t
′
. Actions can then be chosen based on this
estimated (visible) system state, until time t + k at which point another biopsy is taken and the estimated
state can once again be corrected to match the hidden state.
Experimental trials were performed to evaluate the ability of various chemotherapy policies to guide
the system state towards the target. Moreover, this ability was evaluated over several different update
intervals k, to determine the robustness of the policy as information becomes more sparse. For each policy
and update interval, 100 sets of 1000 numerical trials were performed, for a total of 100, 000 simulated
trajectories. The initial condition for each trial was selected using pseudo-random number generation,
with the only restrictions being that the initial condition is neither on an edge/vertex of the simplex, nor
at the target region. Experiments were terminated upon reaching the target region (which counts as a
success) or reaching an edge of the simplex (which counts as a failure). Within each set, the proportion of
successful trials is computed. This proportion is then averaged across all 100 sets, and a sample standard
deviation is computed as a measure of uncertainty. The results of this experiment are presented in Figure
3.11.
Here, the performance of the constructed policy (Figure 3.9), homogenized policy (Figure 3.10), and three
optimized policies (Figure 3.1) can be compared. Notice that four additional policies were also investigated:
(1) "c1 only," which takes the action ⃗c = (1, 0) at all times; (2) "c2 only," which takes the action ⃗c = (0, 1)
at all times; (3) no chemo, which takes the action ⃗c = (0, 0) at all times; and (4) "random action," which
chooses, at each time step, a random one of the three available actions. Notably, the actions "c1 only"
and "c2 only" are perfectly symmetric, and the yellow curve corresponding to the former is hidden behind
the pink curve associated with the latter. Of course, none of these are remotely acceptable chemotherapy
policies for any setting, and are included here purely to serve as benchmarks for the remaining policies.
45
Figure 3.11: Experimental probability of reaching the target region for each presented policy, alongside
some sample benchmarks.
Notably, at lower update intervals k, the three learned policies perform very similarly, with all policies
successfully guiding the state towards the target in more than 90% of experimental trials. As the update
interval increases, the four-region policy seems to outperform the three-region policy, which in turn beats
out the two-region. Curiously, this is the only result in which significant differences in performance of the
three policies is identified. In any case, even the worst of these policies achieves a roughly 40% success
rate at an update interval of 400 steps, which is long enough to traverse from one side of the simplex to
another many times over.
The simplified policies do not perform as favorably. When no update interval is present (i.e. with
perfect knowledge of the system state at all times), the constructed policy achieves a moderately high rate
of success, but sees a sharp drop-off in performance as the update interval increases, eventually being
beaten by the "random action" benchmark at intervals exceeding 100 steps. Still, if information on the cell
subpopulation balance can be easily attained at high frequency, and if the action switch rates associated
46
with the optimized policies are too high, the constructed policy may provide a reasonable approach to
chemotherapy scheduling. In other words, this policy is effective, but not robust. On the other hand, the
homogenized policy performs poorly regardless of the availability of information. It barely beats out the
best of the benchmark policies when perfect information is available, and actually performs worse than
both the "random action" and "no chemo" policies with even small windows of missing information. A
possible explanation for this is that the homogenized policy prescribes using some amount of chemo at
almost every state except for a small sliver of the simplex below the target, which is too effective against
the S population and likely to result in S extinction. This would also explain why the only policies that
perform worse are the benchmark policies that prescribe a full dose of chemotherapy at every step of the
process.
47
Chapter 4
Conclusions
An evolutionary game-based model was presented for simulating the evolutionary dynamics of a mixedstrategy cancer population. The model was used to generate three optimized chemotherapy policies using
Q-learning, each trained using a different reward structure. The goal of these policies was to prevent
competitive release of chemo-resistant cancer cells by leveraging the greater natural fitness of chemosensitive cells. To achieve this, chemotherapy policies should seek to drive the system towards highentropy regions of the state-space, corresponding to regions of high cell diversity, where species extinction
is less likely to occur via random drift. Specifically, a subset of the state space corresponding to particularly
high entropy values was designed as the target for all policies.
The optimized policies were found to be highly effective at driving the system state towards these highentropy states. First, the locally most probable trajectories for a range of initial conditions were computed,
and all were found to guide the system to the target. This remained true even with small uncertainties
in the initial condition, though larger uncertainties yielded a decline in the proportion of paths achieving
these high entropy states. Stochastic trials similarly identified the effectiveness of the optimized policies,
with all policies achieving over a 90% success rate in guiding the evolution of the system towards the
target. In these trials, policies were also assessed in their performance when information on the system
state is sparse. Though their performance declined with the availability of information, the policies were
48
still able to achieve a success rate in excess of 40% even when information became extremely limited.
Two simplified policies were also constructed: one based on a structural observation of the optimized
policies, and another generated by a homogenization of an optimized policy. The former performed well
with abundant information, achieving a success rate of over 70%, but saw a sharp decline in performance
as information became sparse. The other simplified policy was far less effective, regardless of information
availability.
Having identified the performance and weaknesses of the optimized policies, there is much room to
improve on this approach in the future. One immediate consideration would be to increase the action space
A. Currently, only three actions are allowed which limits the agency of any optimized policy. Actions
utilizing smaller doses of chemotherapy, or combining different drugs at once, may expand the potential
of this work. It may also be expedient to employ a more complicated reward structure for training that is
allowed to vary based on the previous state and action taken, rather than exclusively on the destination
state. By basing rewards on the previous state, it is possible to encourage the agent to traverse in a desirable
direction. Specifically, moving to a state ⃗x from a state ⃗x ′
that is at lower entropy than ⃗x could yield higher
rewards than moving to ⃗x from a state ⃗x ′′ that is at higher entropy. On the other hand, rewards can be
adjusted based on the action chosen, with higher rewards being assigned to actions that prescribe less
chemotherapy. In this way, the agent may learn not only to guide the system towards desirable states, but
to do so while minimizing the total amount of chemotherapy required.
While the underlying model greatly simplifies tumor evolutionary dynamics, Q-learning is a modelfree reinforcement learning algorithm that can reasonably be applied to more complex models as they
become available, or even to experimental data, provided that they can be translated into a Markov decision
process. Ultimately, reinforcement learning proved to be an effective technique in achieving policies that
avoid evolution of resistance, and there is still significant opportunity for improvement on these results.
49
Bibliography
[1] Strasser A., O’Connor L., and Dixit V. M. “Apoptosis signaling”. In: Annu Rev Biochem. 69 (2000),
pp. 217–45. doi: 10.1146/annurev.biochem.69.1.217.
[2] C. S. O. Attolini and F. Michor. “Evolutionary Theory of Cancer”. In: Annals of the New York
Academy of Sciences 1168 (1 2009), pp. vii, 3–228.
[3] Ivana Bozic and Martin A. Nowak. “Resisting Resistance”. In: Annual Review of Cancer Biology
1.Volume 1, 2017 (2017), pp. 203–221. issn: 2472-3428. doi:
https://doi.org/10.1146/annurev-cancerbio-042716-094839.
[4] S. L. Brunton and J. N. Kutz. Data Driven Science and Engineering: Machine Learning, Dynamical
Systems, and Control. Second. Cambridge University Press, 2022. Chap. 11.
[5] K. T. Bush, A. Camins, and M. Farina. “Membrane Transporters as Mediators of Cisplatin Effects
and Side Effects”. In: Scientifica (2012).
[6] Farhan R. Chowdhury and Brandon L. Findlay. “Fitness Costs of Antibiotic Resistance Impede the
Evolution of Resistance to Other Antibiotics”. In: ACS Infectious Diseases 9.10 (2023). PMID:
37726252, pp. 1834–1845. doi: 10.1021/acsinfecdis.3c00156. eprint:
https://doi.org/10.1021/acsinfecdis.3c00156.
[7] Helena Coggan and Karen M. Page. “The role of evolutionary game theory in spatial and
non-spatial models of the survival of cooperation in cancer: a review”. In: Journal of The Royal
Society Interface 19.193 (2022), p. 20220346. doi: 10.1098/rsif.2022.0346. eprint:
https://royalsocietypublishing.org/doi/pdf/10.1098/rsif.2022.0346.
[8] J. Connell. “The influence of interspecific competition and other factors on the distribution of the
barnacle Chthamalus stellatus”. In: Ecology 42 (1961), pp. 710–743.
[9] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Ltd, 2005. isbn:
9780471748823. doi: 10.1002/047174882X.
[10] Centers for Disease Control and Prevention. WISQARS Leading Causes of Death Visualization Tool.
https://wisqars.cdc.gov/lcd/?o=LCD&y1=2021&y2=2021&ct=10&cc=ALL&g=00&s=0&r=0&ry=0&e=0&ar=lcd1age&at=groups&ag=lcd1age&a1=0&a2=199.
Accessed: 04/22/2024. 2021.
50
[11] R. J. Gillies, D. Verduzco, and R. A. Gatenby. “Evolutionary Dynamics of Carcinogenesis and Why
Targetted Therapy Does Not Work”. In: Nature Reviews Cancer 12 (7 2012), pp. 487–493.
[12] Hina Gul et al. “Fitness costs of resistance to insecticides in insects”. In: Frontiers in Physiology 14
(2023). issn: 1664-042X. doi: 10.3389/fphys.2023.1238111.
[13] D. Hanahan and R. Weinberg. “The Hallmarks of Cancer”. In: Cell 100 (1 2000), pp. 57–70.
[14] Sabine Hummert et al. “Evolutionary game theory: cells as players”. In: Mol. BioSyst. 10 (12 2014),
pp. 3044–3065. doi: 10.1039/C3MB70602H.
[15] Irina Kareva. “Different costs of therapeutic resistance in cancer: Short- and long-term impact of
population heterogeneity”. In: Mathematical Biosciences 352 (2022), p. 108891. issn: 0025-5564. doi:
https://doi.org/10.1016/j.mbs.2022.108891.
[16] Irina Kareva. “Prisoner’s Dilemma in Cancer Metabolism”. In: PLOS ONE 6.12 (Dec. 2011), pp. 1–9.
doi: 10.1371/journal.pone.0028576.
[17] Adi Kliot and Murad Ghanim. “Fitness costs associated with insecticide resistance”. In: Pest
Management Science 68.11 (2012), pp. 1431–1437. doi: https://doi.org/10.1002/ps.3395. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1002/ps.3395.
[18] Tomlinson I. P. M. et al. “The mutation rate and cancer”. In: PNAS 93.25 (Nov. 1996),
pp. 14800–14803.
[19] Y. Ma and P. K. Newton. “Role of synergy and antagonism in designing multidrug adaptive
chemotherapy schedules”. In: Phys. Rev. E 103 (3 Mar. 2021), p. 032408. doi:
10.1103/PhysRevE.103.032408.
[20] J. Maynard Smith. “The theory of games and the evolution of animal conflicts”. In: Journal of
Theoretical Biology 47.1 (1974), pp. 209–221. issn: 0022-5193. doi:
https://doi.org/10.1016/0022-5193(74)90110-6.
[21] A. Mazzocca. “The Systemic–Evolutionary Theory of the Origin of Cancer (SETOC): A New
Interpretative Model of Cancer as a Complex Biological System”. In: International Journal of
Molecular Sciences (2019).
[22] Anita H. Melnyk, Alex Wong, and Rees Kassen. “The fitness costs of antibiotic resistance
mutations”. In: Evolutionary Applications 8.3 (2015), pp. 273–283. doi:
https://doi.org/10.1111/eva.12196. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/eva.12196.
[23] S. F. Melo. Convergence of Q-learning: a simple proof. Lisbon, Portugal.
[24] L. Merlo, J. Pepper, and B. Reid. “Cancer as an Evolutionary and Ecological Process”. In: Nature
Reviews Cancer 6 (12 2006), pp. 924–935.
[25] P. K. Newton and Y. Ma. “Nonlinear adaptive control of competitive release and chemotherapeutic
resistance”. In: Phys. Rev. E 99 (2 Feb. 2019), p. 022404. doi: 10.1103/PhysRevE.99.022404.
51
[26] P. K. Newton et al. “Entropy, complexity and Markov diagrams for random walk cancer models”.
In: Scientific reports 4 (1 2014).
[27] M. A. Nowak. Evolutionary Dynamics: Exploring the Equations of Life. Belknap Press: An Imprint of
Harvard University Press, 2006.
[28] Martin A. Nowak and Karl Sigmund. “Evolutionary Dynamics of Biological Games”. In: Science
303.5659 (2004), pp. 793–799. doi: 10.1126/science.1093411. eprint:
https://www.science.org/doi/pdf/10.1126/science.1093411.
[29] A. Ramos, S. Sadeghi, and Tabatabaeian H. “Battling Chemoresistance in Cancer: Root Causes and
Strategies to Uproot Them”. In: International Journal of Molecular Science 22.17 (2021).
[30] N. Rhind and P. Russel. “Signaling pathways that regulate cell division”. In: Cold Spring Harb
Perspect Biol. (Oct. 2012).
[31] J. H. Saltzer, D. P. Reed, and D. D. Clark. “End-to-End Arguments in System Design”. In: ACM
Trans. Comput. Syst. 2.4 (Nov. 1984), pp. 277–288. issn: 0734-2071. doi: 10.1145/357401.357402.
[32] S. Singh et al. “Pattern of Adverse Drug Reactions to Anticancer Drugs: A Quantitative and
Qualitative Analysis.” In: Indian Journal of Medical and Paediatrical Oncology 38 (2 2017).
[33] J. M. Smith and G. R. Price. “The Logic of Animal Conflict”. In: Nature 246 (1973), pp. 15–18.
[34] John Maynard Smith. “Evolutionary game theory”. In: Physica D: Nonlinear Phenomena 22.1 (1986).
Proceedings of the Fifth Annual International Conference, pp. 43–49. issn: 0167-2789. doi:
https://doi.org/10.1016/0167-2789(86)90232-0.
[35] Kateřina Staňková et al. “Optimizing Cancer Treatment Using Game Theory”. In: JAMA Oncology
5 (Aug. 2018). doi: 10.1001/jamaoncol.2018.3395.
[36] Gergely Szakács et al. “Targeting the Achilles Heel of Multidrug-Resistant Cancer by Exploiting
the Fitness Cost of Resistance”. In: Chemical Reviews 114.11 (2014). PMID: 24758331, pp. 5753–5774.
doi: 10.1021/cr4006236. eprint: https://doi.org/10.1021/cr4006236.
[37] Peter D. Taylor and Leo B. Jonker. “Evolutionary stable strategies and game dynamics”. In:
Mathematical Biosciences 40.1 (1978), pp. 145–156. issn: 0025-5564. doi:
https://doi.org/10.1016/0025-5564(78)90077-9.
[38] C. Watkins and P. Dayan. “Q-Learning”. In: Machine Learning 8 (3 1992), pp. 279–292.
[39] C. J. C. H. Watkins. “Learning from Delayed Rewards”. PhD thesis. University of Cambridge, 1989.
[40] A. B. Weiner et al. “The Cost of Prostate Biopsies and their Complications: A Summary of Data on
All Medicare Fee-for-Service Patients over 2 Years”. In: Urology Practice 7 (2 2019).
[41] J. West et al. “AN EVOLUTIONARY MODEL OF TUMOR CELL KINETICS AND THE
EMERGENCE OF MOLECULAR HETEROGENEITY DRIVING GOMPERTZIAN GROWTH”. In:
SIAM review. Society for Industrial and Applied Mathematics 58.4 (2016).
52
[42] J. West et al. “The prisoner’s dilemma as a cancer model”. In: Convergent science physical oncology
(2016).
[43] Jeffrey West and Paul K. Newton. “Chemotherapeutic Dose Scheduling Based on Tumor Growth
Rates Provides a Case for Low-Dose Metronomic High-Entropy Therapies”. In: Cancer Research
77.23 (Nov. 2017), pp. 6717–6728. issn: 0008-5472. doi: 10.1158/0008-5472.CAN-17-1120. eprint:
https://aacrjournals.org/cancerres/article-pdf/77/23/6717/2761043/6717.pdf.
[44] Jeffrey West et al. “Towards Multidrug Adaptive Therapy”. In: Cancer Research 80.7 (Apr. 2020),
pp. 1578–1589. issn: 0008-5472. doi: 10.1158/0008-5472.CAN-19-2669. eprint:
https://aacrjournals.org/cancerres/article-pdf/80/7/1578/2801677/1578.pdf.
[45] Benjamin Wölfl et al. “The Contribution of Evolutionary Game Theory to Understanding and
Treating Cancer”. In: Dynamic Games and Applications 12 (Aug. 2021). doi:
10.1007/s13235-021-00397-w.
[46] Y. Zhang et al. “Biopsy frequency and complications among lung cancer patients in the United
States”. In: Lung Cancer Management 9 (4 2020).
53
Abstract (if available)
Abstract
While chemotherapy is an effective tool at combating cancer, it has also been known to promote the evolution of chemo-resistance within a tumor by the mechanism of competitive release of chemo-resistant cancer phenotypes. In this work, a reinforcement-learning approach is used to develop optimized chemotherapy schedules to control the coevolution of three cancer cell phenotypes. A stochastic, discrete-time evolutionary game is established as a model upon which to train the reinforcement learning algorithm. This model includes two chemotoxins, C₁ and C₂, and three coevolving tumor cell populations: S, sensitive to both C₁ and C₂; R₁, resistant to C₂ but sensitive to C₁; and R₂, resistant to C₁ but sensitive C₂. In the absence of chemotherapy, a prisoner’s dilemma game is played between the S population and each of the resistant populations, establishing a "cost to resistance." Learned policies are evaluated by examination of the evolutionary trajectories they promote, as well as by probabilistic experiments to determine their effectiveness in avoiding species extinction (which would ultimately result in fixation of a chemo-resistant phenotype). It was found that the learned policies were able to achieve species diversity by leveraging the natural dominance of the S subpopulation to out-compete the resistant populations, while administering chemotherapy when necessary to prevent fixation of any individual cell type.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Stochastic multidrug adaptive chemotherapy control of competitive release in tumors
PDF
Deterministic evolutionary game theory models for the nonlinear dynamics and control of chemotherapeutic resistance
PDF
Computational tumor ecology: a model of tumor evolution, heterogeneity, and chemotherapeutic response
PDF
Adaptive agents on evolving network: an evolutionary game theory approach
PDF
Differential stress resistance (DSR) and differential stress sensitization (DSS): molecular mechanisms behind the efficacy of short-term starvation (STS)
PDF
Parameter estimation problems for stochastic partial differential equations from fluid dynamics
PDF
Robust and adaptive online reinforcement learning
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Machine learning in interacting multi-agent systems
PDF
RNA methylation in cancer plasticity and drug resistance
PDF
Temporal and spatial characterization of cisplatin treatment and emerging acute resistance in bladder cancer cells
PDF
Asymptotic expansion for solutions of the Navier-Stokes equations with potential forces
PDF
Reinforcement learning for the optimal dividend problem
PDF
Applying multi-omics in cancer liquid biopsy for improved patient monitoring and biomarker discovery
PDF
Dopamine dependent: examining the link between learning and treatment-resistant depression
PDF
Learning social sequential decision making in online games
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Data-driven learning for dynamical systems in biology
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Application of statistical learning on breast cancer dataset
Asset Metadata
Creator
Giles, Matthew L.
(author)
Core Title
Reinforcement learning based design of chemotherapy schedules for avoiding chemo-resistance
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Applied Mathematics
Degree Conferral Date
2024-05
Publication Date
06/03/2024
Defense Date
05/26/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cancer,chemo-resistance,chemotherapy,dynamic control,evolutionary dynamics,evolutionary game theory,game theory,machine learning,OAI-PMH Harvest,prisoner's dilemma,Q-learning,reinforcement learning,resistance
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Newton, Paul K. (
committee chair
), Kukavica, Igor (
committee member
), Ziane, Mohammed (
committee member
)
Creator Email
gilesml@usc.edu,mattlgiles@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113984879
Unique identifier
UC113984879
Identifier
etd-GilesMatth-13051.pdf (filename)
Legacy Identifier
etd-GilesMatth-13051
Document Type
Thesis
Format
theses (aat)
Rights
Giles, Matthew L.
Internet Media Type
application/pdf
Type
texts
Source
20240603-usctheses-batch-1164
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
chemo-resistance
chemotherapy
dynamic control
evolutionary dynamics
evolutionary game theory
game theory
machine learning
prisoner's dilemma
Q-learning
reinforcement learning
resistance