Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The evolution of gene regulatory networks
(USC Thesis Other)
The evolution of gene regulatory networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
The Evolution of Gene Regulatory Networks
by
Joshua Seth Schiman
A dissertation presented to the
faculty of the usc graduate school
for the degree of
Doctor of Philosophy
in Molecular Biology
December 2018
Department of Molecular & Computational Biology
University of Southern California
Los Angeles, California
Acknowledgements
The research presented in this dissertation would not have been possible without the guidance and support of
my advisors, Professors Peter Ralph and Sergey Nuzhdin. I want to thank Peter for being so generous with
his time, advice, and technical help, and for his continuous encouragement and positive attitude. He fostered
a very relaxed, open-minded and intellectually stimulating environment, in which the pursuit of new ideas
and techniques was encouraged. I also want to thank Sergey for encouraging any and all interesting ideas
to be discussed and pursued, for sharing his deep knowledge of the literature, and for always encouraging
collaborations. One such collaboration lead me to Saint Petersburg, Russia and yielded my rst scientic
publication. I would like to thank Maria Samsonova's laboratory for hosting me in Saint Petersburg, as
well as my colleagues there, Aleksandra Chertkova, Konstantin Kozlov and Vitaly Gursky. I would like to
specically thank Vitaly for managing our project and for making me feel at home during my visit. I also
would like to thank the members of my committee, Matt Dean and Ian Ehrenreich. I have appreciated their
guidance and insight along the way, and their enthusiasm for science has been much appreciated.
During the course of my study I also enjoyed interacting with the other students here and would like to
thank my ocemate, Erik Lundgren, for all of the fun (mostly o topic) discussions, and for quickly catching
some of my mathematical errors. I also want to thank Hossein Asgharian for the fun discussions and debates.
Lastly, I would like to thank my family and friends. I want to thank my father for encouraging my interest
in science and intellectual pursuits from a very young age. Whenever we speak, either in person or on the
phone, our conversations frequently end up focusing on something scientic. I also want to thank my mother
and father for all the trips to the many museums, zoos, aquariums and the national parks, which in part,
sparked my interest in the natural world. I want to thank my sisters, Caryn, Julie and Rebecca, for their
constant support and encouragement. I want to specically thank my twin sister, Rebecca, and my friends:
Dan, Daniel, Justin, Erika, Adam, Renae and Evan; their support and friendship made my experience in
Los Angeles about as fun as possible.
1
Contents
1 Introduction 4
I The Gap Gene Regulatory Network 9
2 In silico evolution of the Drosophila gap gene regulatory sequence under elevated mu-
tational pressure 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Regulatory sequences and binding sites for gap genes . . . . . . . . . . . . . . . . . . . 13
2.2.2 Gene expression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Evolutionary algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Transcription factor binding site tracking . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Sequence simplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Variability of binding energy for dierent sets of sites and sequence annotations . . . . 18
2.3.3 Correlations with rms-scores of individuals . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Mutual correlations between sets of sites and sequence segments . . . . . . . . . . . . 21
2.3.5 Core TFBSs and their vicinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.A Model equation and parameter values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.B Supplementary gures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
II Evolutionary System Theory 38
3 System drift and speciation 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Equivalent gene networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Hybrid incompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Haldane's rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 System drift and the accumulation of incompatibilities . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 System drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Genetic variation in empirical regulatory systems . . . . . . . . . . . . . . . . . . . . . 58
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Fisher's geometric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.2 The origin of species not by means of natural selection? . . . . . . . . . . . . . . . . . 60
3.4.3 Nonlinearity and model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2
3.A Genetic drift with a multivariate trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.A.1 Gaussian load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.B Evolution of segregation covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Chance and necessity in the evolution of gene network complexity 68
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 System theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Minimal system realizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.3 Random system generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.4 Evolutionary simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Moonlighting among systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Ratchet Propensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.3 Evolutionary dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Bibliography 86
3
Chapter 1
Introduction
Evolutionary biology emerged out of inquiries into geology, zoology, botany and taxonomy. In combining
these various and interrelated concepts and methods with extraordinarily detailed observations in natural
history, the theory of evolution by means of natural selection was articulated [Darwin, 1859]. This revolution,
however, occurred before major developments in genetics were well known, and long before the molecular
revolution in biology [Watson and Crick, 1953]. The types of questions and processes studied by evolutionary
biologists have changed in accordance with these developments. The Modern Synthesis merged concepts
from Mendelian Genetics with Darwinism, and decades later, the molecular revolution directed focus to
questions about sequence evolution, leading to the development of the Neutral Theory of Molecular Evolution
[Kimura et al., 1968, King and Jukes, 1969] { a process that either did not seem apparent or perhaps
relevant previously. Since then, the tools and concepts of biology have further expanded, changing the
discourse of evolutionary biology to include studies into development, whole genomes, as well as sophisticated
mathematical and computational methods.
Advances in molecular and developmental biology have lead to the understanding that many phenotypic
traits are the product of many interacting genes, often in the form of gene regulatory networks. As such,
much eort has been expended to understand both the function and evolution of such networks or otherwise
to elucidate the genotype-phenotype map. Asking about the function or evolution of a gene, removed from
its context, in the functioning of such a network, is a fruitful exercise, however, much information and
understanding is lost, as a gene's context can signicantly in
uence its behavior [Lewontin, 1974]. This view
motivates the discipline of systems biology, which intends to study genes in context, or biological systems
(including gene regulatory networks). This discipline also typically involves the application of quantitative
4
methods, sometimes absent or underdeveloped in other approaches, to model or simulate system dynamics.
In this dissertation I explore topics in the domain of evolutionary systems biology, in the rst part (I) by
studying the evolution of a computational model of a biological system, and in the second part (II) by
developing and applying a theoretical framework of system evolution to study species formation, and next
to study the evolution of system topology and complexity.
First, in Chapter 2: In silico evolution of the Drosophila gap gene regulatory sequence
under elevated mutational pressure, I study the evolution of a computational gene regulatory network
model. The biological mechanisms underlying pattern formation have fascinated biologists for decades.
Here, with my colleagues at the University of Southern California and at Saint Petersburg State Polytechnic
University, Aleksandra Chertkova, Sergey Nuzhdin, Konstantin Kozlov, Maria Samsonova and Vitaly Gursky,
I model the spatiotemporal mRNA and protein expression dynamics of the Drosophila melanogaster gap gene
regulatory network using a hybrid reaction-diusion/thermodynamic model. This model incorporates eight
genes (which function as transcription factors), Hunchback, Knirps, Kruppel, Giant, Caudal, Huckebein,
Tailless and Bicoid, as well as the experimentally determined regulatory sequences for the former four genes.
The four regulatory regions can harbor transcription factor binding sites for all eight modeled genes.
Here I simulated the evolution of a population, whose individuals' phenotypes were determined by the gap
gene network expression dynamics, under elevated mutational pressure { by randomly changing nucleotides
in the gap gene regulatory sequences. These mutations had the potential to disrupt, add or modify the tran-
scription factor binding dynamics, and thus could change the gap gene expression patterning and the course
of embryogenesis. The tness of individuals in this simulation was judged by comparing the spatiotemporal
gap gene expression dynamics with a pattern derived from experimental expression data. After 3; 500 gener-
ations of evolution, we observed that regulatory sequences, in response to the elevated mutational pressure,
reorganized such that on average, there were fewer, albeit stronger transcription factor binding sites present
when compared to the initial regulatory sequences. This was likely to reduce its mutational target size.
We also noted frequent transcription factor binding site turnover: fewer than 10% of the initial binding
sites were maintained throughout the entire simulation. These evolutionary constant, or core, sites demon-
strated an interesting characteristic: they have increased functional importance as assessed under wild-type
conditions and their binding energy distributions are highly conserved. In addition to this, transcription fac-
tor binding sites within close proximity of these core sites exhibited increased longevity, re
ecting functional
regulatory interactions with core sites.
In Chapter 3: System drift and speciation, Peter Ralph and I develop a theoretical framework to
5
study biological systems and apply it to study the formation of new species by a non-adaptive mechanism.
The question of speciation is perhaps the oldest question in evolutionary biology, and the basis for the title of
Darwin's On the Origin of Species. Despite this, the mechanisms involved, have been debated considerably
{ there is disagreement as to whether these mechanisms are predominantly neutral versus selective, and if
speciation itself is only a byproduct of adaptation [Orr et al., 2004].
Within the framework presented here, a biological system is viewed as a \black box" whose output is
its phenotype, which is the product of its internal state, or kryptotype, and an environmental input. I
study the evolution of an optimally adapted population under stable conditions, that is where phenotype
is under stabilizing selection, and selective pressures remain constant. Conservation of phenotype through
evolutionary time does not necessitate the conservation of the underlying mechanism, as many distinct
(and mutationally connected) molecular pathways can realize identical phenotypes. To describe this process
mathematically, I apply tools from system theory [Kalman et al., 1969, Zadeh and Deoser, 1976]: the
molecular machinery underlying the phenotype is described as a linear dynamical system, in which genes
can interact via mutationally pliable strengths. The environment provides a time-varying input to this
system, and the system's output is its phenotype { that is the dynamics upon which natural selection acts.
In this context, I give an analytical description of the set of all mechanisms: the set of all linear molecular
pathways that produce identical phenotypes, and in so doing, give an explicit model of phenogenetic [Weiss
and Fullerton, 2000] and developmental system [True and Haag, 2001] drift. In the minimal case { that is
the case in which there are no more and no fewer molecules interacting than necessary to produce a given
phenotype { this set is given by a simple change of coordinates. If gene networks are not minimal, as I
suspect is often the case (I further examine this in Chapter 4), the set of all possible pathway architectures
is elegantly given by the Kalman decomposition [Kalman, 1963], a classic result from system theory. These
results demonstrate that selectively neutral directions in genotype space are common.
Dierent pathway architectures, despite being phenotypically indistinguishable, may be incompatible with
one another when brought together in a hybrid organism. As such, system drift may result in reproductive
incompatibility between geographically isolated populations, even in the absence of any sort of adaptive,
selective, or environmental change. Using my model, I show that hybrid incompatibility can occur solely due
to system drift, and thatF
1
hybrid phenotypes diverge from their parents at a quadratic rate with respect to
the genetic distance separating their parents. Furthermore I show that F
2
phenotypes diverge much faster;
at a linear rate with respect to genetic distance (note that linear rates are initially larger than quadratic
rates). This nding is consistent with the phenomenon known as hybrid breakdown, often observed in the
6
early stages of speciation.
Next I show that Haldane's rule [Haldane, 1922] is a natural consequence of my model. Heterogametic
(e.g., Drosophila or mammalian males and female birds) F
1
hybrids, if a gene network is composed of genes
distributed among both the sex and autosomal chromosomes, behave eectively like F
2
crosses, and thus
will typically have much lower tnesses than their homogametic F
1
siblings. This mechanism appears to
be novel and distinct from the \dominance theory" [Turelli and Orr, 1995], which is frequently invoked to
explain Haldane's rule.
Motivated by our results from system theory, Peter Ralph and I next introduce a more general quantitative
genetics model to estimate the rate and therefore plausibility of speciation via system drift. We show that
independent populations should undergo system drift at a rate proportional to their segregating regulatory
variation divided by the eective population size (N
e
). We nd that under reasonable biological parameter
schemes, system drift can lead to speciation in less thanN
e
generations { and possibly even much faster. This
suggests that system drift { that is, the adaptively neutral changes to molecular pathways underpinning a
trait { can occur at a rate fast enough to contribute to the formation of new species. This may be surprising
because hybrid inviability appears as a consequence of recombining dierent, yet functionally equivalent,
mechanisms, and since species are often dened by their unique adaptations or morphologies.
In Chapter 4: Chance and necessity in the evolution of gene network complexity, I further
apply system theory to study the evolution of gene network composition and architecture. In the previous
chapter I showed how dierent network architectures can produce equivalent phenotypes and alluded to the
possibility of these architectures being of dierent sizes. Here, Peter Ralph and I look more closely at the
selective and mutational pressures that contribute to gene network size, and ask under which circumstances
will networks grow to be unnecessarily complex or shrink to be highly ecient, streamlined structures.
To do this I simulate the evolution of a population under stabilizing selection for phenotype, and examine
changes in gene regulatory network architecture through time, under a variety of parameter schemes. I see
that, within a specic parameter range, due to the de novo addition of genes, networks can grow dramatically
in size (and complexity), even under stabilizing selection. This dramatic network growth can have serious
consequences, leading to a mutational meltdown and extinction in asexual populations. This process can
be strongly curtailed when there is strong selection against mutational target size, or if the rate of gene
additions is signicantly smaller than the rate of deletions.
I also show that initially independent (or orthogonal) networks are likely to become intertwined over
evolutionary time due to the action of system drift { this network tangling is distinct from the de novo
7
addition of genes, and likely operates across a broader range of molecular and population parameter regimes.
Similarly, I demonstrate the expansion of networks, not by the addition of new genes, but by recruiting from a
xed set of mutationally available genes. Lastly, I point out that the apparent complexity of gene regulatory
networks, while possibly the consequence of a non-adaptive ratchet, in some cases could be the product of
reductive selection, whereby systems can share components/kryptotypes and reduce their composite size,
in the face of strong selection against mutational target size. Reductive selection, like constructive neutral
evolution [Stoltzfus, 1999], may entail some similar consequences, such as irreversible contingency [Szathm ary
and Smith, 1995], the inability to break a composite system into independently functional modules.
8
Part I
The Gap Gene Regulatory Network
9
Chapter 2
In silico evolution of the Drosophila
gap gene regulatory sequence under
elevated mutational pressure
Abstract
Background Cis-regulatory sequences are often composed of many low-anity transcription factor bind-
ing sites. Determining the evolutionary and functional importance of regulatory sequence composition is
impeded without a detailed knowledge of the genotype-phenotype map.
Results We simulate the evolution of regulatory sequences involved in Drosophila melanogaster embryo
segmentation during early development. Natural selection evaluates gene expression dynamics produced
by a computational model of the developmental network. We observe a dramatic decrease in the total
number of transcription factor binding sites through the course of evolution. Despite a decrease in average
sequence binding energies through time, the regulatory sequences tend towards organisations containing
increased high anity transcription factor binding sites. Additionally, the binding energies of separate
sequence segments demonstrate ubiquitous mutual correlations through time. Fewer than 10% of initial
TFBSs are maintained throughout the entire simulation, deemed `core' sites. These sites have increased
functional importance as assessed under wild-type conditions and their binding energy distributions are
highly conserved. Furthermore, TFBSs within close proximity of core sites exhibit increased longevity,
re
ecting functional regulatory interactions with core sites.
Conclusion In response to elevated mutational pressure, evolution tends to sample regulatory sequence
organisations with fewer, albeit on average, stronger functional transcription factor binding sites. These
organisations are also shaped by the regulatory interactions among core binding sites with sites in their local
vicinity.
This chapter has been published in BMC Evolutionary Biology:
AA Chertkova, JS Schiman, SV Nuzhdin, KN Kozlov, MG Samsonova, and VV Gursky. In silico evolution of the Drosophila
gap gene regulatory sequence under elevated mutational pressure. BMC Evolutionary Biology, 17(1):4, 2017. ISSN 1471-2148.
doi: 10.1186/s12862-016-0866-y. URL http://dx.doi.org/10.1186/s12862-016-0866-y.
Back to the Table of Contents
10
2.1 Introduction
Historically, the study of mathematical evolution was practiced as the study of the changes in gene frequen-
cies, as a consequence of neutral and selective evolutionary forces. For the sake of simplicity, many of these
early population genetic models analyzed these processes without considering the complications imposed by
including genotype-phenotype mappings, such as descriptions of the genetic regulatory networks (GRNs) in-
volved in development. Over the past several decades models of gene regulatory networks (of varying degrees
of realism) have been developed and applied to ll this gap in evolutionary research [Wagner, 1994, Guti errez
and Maere, 2014, Salazar-Ciudad and Mar n-Riera, 2013, Crombach et al., 2016a]. Studies lacking highly
detailed computational descriptions of relevant GRNs typically draw inferences based solely on sequence, and
the direct properties thereof, potentially missing subtleties that can only be deduced from a systems-level
approach. For instance, previous work suggests that the binding anity of a transcription factor binding
site (TFBS) only weakly predicts its phenotypic importance [Kozlov et al., 2014, 2015a, Parker et al., 2011,
Ramos and Barolo, 2013], contradicting the naive notion that binding site strength strongly predicts selec-
tive importance. Consistent with this view, we study the evolutionary dynamics of biological regulatory
sequences employing a systems level approach. We simulate sequence evolution using an experimentally
derived, computational model of a developmental network and genotype-phenotype mapping [Kozlov et al.,
2015a].
We simulate expression of four segmentation genes during early fruit
y development using a hybrid
reaction-diusion and thermodynamic computational model, t to empirical expression patterns and to
wild-type regulatory sequences. Given a sequence, TFBSs and their respective anities are assigned, and
gene product concentrations (mRNA and proteins) are calculated [Kozlov et al., 2015a]. Both reaction-
diusion and 'gene-circuit' modeling approaches have been successfully applied to study gene network and
enhancer evolution in Drosophila development [Samee and Sinha, 2013, He et al., 2012, Duque et al., 2014,
Kim et al., 2013, Martinez et al., 2014, Crombach et al., 2014, 2016a].
The cumulative binding energy, and thus the net impact, a regulatory sequence imposes on its target
gene is a complex function of, minimally: DNA accessibility, TFBS presence and quantity, and transcription
factor/TFBS anity. As such, many dierent combinations of these variables, in principle, can precipitate
similar, if not identical, regulatory eects. However, as the regulatory sequences are a consequence of gradual,
neutral and selective evolutionary changes, subsets of these functionally-equivalent schemes may be more and
less likely to emerge and more and less stable through evolutionary time, given specic historical, population,
and mutational conditions. Therefore, the sets of acceptable regulatory schemes connected by either neutral
11
or compensatory evolutionary changes are of particular interest to evolutionary biologists.
It is well documented and experimentally demonstrated that there is tremendous regulatory sequence
variation and divergence among even closely related species, despite conservation of expression patterns and
regulatory dynamics [Paris et al., 2013, Hare et al., 2008a, Crocker et al., 2015, Bradley et al., 2010, He
et al., 2011]. Presumably TFBS turnover and sequence divergence occurs mostly as populations traverse the
set of equivalent regulatory schemes on a path connected by neutral changes. However, it is not immediately
apparent what such evolutionary paths would look like, and how neutral and selective forces bias this search.
Here we elaborate some of the consequences of high mutational pressure on sequence complexity.
Prior simulation studies have suggested that evolutionary paths tend to sample \complex" regulatory
schemes (i.e. many, weaker TFBSs in contrast with fewer, stronger TFBSs), simply as a result of the
relative frequencies of possible regulatory schemes [He et al., 2012]. It is further emphasized that patterns
of seemingly nonrandom regulatory sequence complexity (measured in part by TFBS quantity) may simply
be the result of non-adaptive processes [Lynch, 2007a, Lusk and Eisen, 2010]. Furthermore, in order to
predict the likelihood of evolutionary changes it may be helpful to catalogue both the functional importance
of specic regulatory regions as well as the (in)visibility of regulatory regions (and sub-regions) to natural
selection.
To add to this discussion, we study the evolutionary dynamics of a population of Drosophila melanogaster,
characterized and assessed by the expression of its early developmental regulatory network under elevated
mutational pressure, focusing our analysis on binding energy prole evolution. The developmental model
employed suggests that the regulatory sequences under study are complex; composed of many weak (rather
than a few strong) binding sites. As such, there is only a weak correlation between a binding site's anity
and its impact on gene expression. We emphasize a specic question: given the small correlation between
sequence binding anity and functional importance, how can the evolutionary signicance (if any) of regu-
latory sequence reorganisation be assessed? We believe that a quantitative answer to this question will be
generalizable to systems lacking such detailed computational genotype-phenotype models.
In this simulation, we observe a quick and dramatic drop in the total quantity of TFBSs in conjunction
with a positive shift in the distribution of remaining regulatory binding energies through evolutionary time.
This reorganisation is partially in
uenced by a sequence's binding anity annotation specicity and redun-
dancy. TFBSs determined to be functionally important (some with weak binding anities) are shown to be
conserved during evolution. Finally, TFBSs in close proximity to important binding sites are more likely to
be maintained through evolutionary time.
12
2.2 Methods
2.2.1 Regulatory sequences and binding sites for gap genes
We analyzed regulatory regions of the gap genes hunchback (hb), Kr uppel (Kr), giant (gt), and knirps (kni)
and extracted TFBSs in these regions using the same procedure as described in Kozlov et al. [2015a]. The
putative regulatory regions spanned 12 Kbp upstream and 6Kbp downstream of the transcription start sites
for each gene in the reference D. melanogaster genome (dm3 / BDGP5; the newer version dm6 contains these
regions unchanged except the shifted coordinates). We predicted TFBSs in these regions for the transcription
factors Bicoid (Bcd), Caudal (Cad), Hunchback (Hb), Giant (Gt), Kr uppel (Kr), Knirps (Kni), Tailless (Tll),
and Huckebein (Hkb) by using position weight matrices (PWMs) [Kulakovskiy and Makeev, 2009]. The
PWMs for these TFs were presented in Kulakovskiy et al. [2010] (http://www.autosome.ru/iDMMPMM/),
and the thresholds were selected as in Kulakovskiy et al. [2009]. We included a predicted TFBS in the model
if it satised at least one of the following conditions: (1) the binding site had a high PWM-score and was
located in the open chromatin domain according to the DNase I accessibility data [Li et al., 2011], or (2) the
site overlapped with one of the cis-regulatory modules according to RedFly database [Gallo et al., 2010], or
(3) the site overlapped with one of the footprint sites [Gallo et al., 2010].
2.2.2 Gene expression model
We predicted gene expression in the Drosophila gap gene network using a hybrid thermodynamic model
reported in Kozlov et al. [2015a] (denoted as `Model 4' in the cited paper). The model combines the
thermodynamic approach for calculating the probability of transcriptional activation of target genes and the
reaction-diusion equations for the spatiotemporal dynamics of gene products, as we describe brie
y in what
follows and in full details in Appendix 2.A. The transcriptional activation of a gene is assessed based on the
information about the TFBSs in its regulatory sequence and concentrations of the TFs Bcd, Cad, Hb, Gt,
Kr, Kni, Tll, and Hkb in nuclei at the A-P axis of the blastoderm-stage embryos during cleavage cycles 13
through 14A. The TFs Bcd, Cad, Tll, and Hkb provide external inputs to the model equations, and the
mRNA and protein concentrations for the gap genes hb, Kr, gt, and kni are calculated as the model output.
Therefore, the model accounts for the regulatory interactions between the four gap genes and the external
in
uence from the other four proteins.
We used values of free parameters in the model that were previously obtained in Kozlov et al. [2015a]
for D. melanogaster reference genome by tting the model output to the wild type gap gene expression data
13
at cellular resolution (Appendix 2.A) [Pisarev et al., 2009]. Using these parameter values, we calculated the
`wild-type' solution U
p
(a;i;t), which is the concentration of gene product p (p = `mRNA' or p = `protein')
for gene a, nucleus i, and time t.
2.2.3 Evolutionary algorithm
We simulated the evolution of 100 haploid, sexually reproducing individuals, holding all parameters, includ-
ing population size, mutation rate (0.001 per base pair per generation), and selection constant for 3,350
generations. Genotypes are represented as the DNAase accessible regions of four 18kb regulatory sequences
(described above)
anking the gap genes hb, Kr, gt, and kni. Phenotypes are the spatial and temporal mRNA
and protein expression dynamics of the respective Drosophila gap gene regulatory network, and are derived
by simulating development with an individuals specic regulatory genotype.
After mutation, phenotypes are computed and then sorted in order of increasingrms-scores a measure of
capability to mimic wild-type computational expression patterns. The rms-score F is calculated as follows:
F =
s
1
N
X
p;a;i;t
(u
p
(a;i;t)U
p
(a;i;t))
2
; (2.1)
where u
p
(a;i;t) is the model output for the mutated regulatory sequence, and U
p
(a;i;t) is the wild-type
model output. The summation takes place over all values of the indices for which the experimental data
on the wild type gene expression is available. The overall number of the terms in this sum equals to N =
6800 [Kozlov et al., 2015a]. From the subpopulation of genomes that produce expression patterns suciently
greater than or equal to the 20th best individual, two parents are randomly selected with replacement
and recombine regulatory sequences with equal probability, until the following generation is populated.
Recombination only shues disparate regulatory sequences, with no recombination break points within the
18kb sequences. The simulation is seeded with a population of genetically identical wild-type regulatory
sequences. The gap gene expression patterns for the 20 best t individual sequences from the last generation
are shown in comparison with the initial patterns in Appendix 2.5.
2.2.4 Transcription factor binding site tracking
The evolutionary trajectories of TFBSs were determined by tracking binding site starting position coordi-
nates. First, we identied TFBSs present in all generations with identical coordinates. For other TFBSs
we discriminated between site movement and site loss by comparing neighboring generations. We assumed
14
that a site S moved if in the next generation there existed a site for the same TF within the vicinity of
S. This vicinity of S is dened as the sequence, three times the site length,
anking and containing S. If
there were no TFBSs for the same TF in this vicinity, S is considered to have disappeared. In addition
to site deaths, site births occur when a TFBS appears in a generation where there were no sites within its
vicinity previously. A birth event is considered to be a rebirth if the new site is located in the above-stated
vicinity of the wild type coordinates ofS. The algorithm stores the trajectories of all TFBSs for a randomly
chosen individual in each generation. We used TFBS tracking for individuals randomly chosen from the
20 most t individuals from each generation. For TFBSs other than the tracked ones, the averaging of the
binding energyE and other quantities is taken across the 20 most t individuals from each generation. This
averaging is justied since the sequences do not show signicant multimodality inE distributions (Appendix
2.5).
2.3 Results
We estimate the phenotypic importance of an individual TFBS in the initial (wild-type) regulatory sequence
by calculating the rms-dierence between the model wild-type expression patterns, with and without a
particular TFBS. We call this dierence the regulatory rms-score of the binding site, and larger values
of this score correspond to a stronger in
uence of the TFBS on expression, and vice versa [Kozlov et al.,
2015a]. Our model reveals only a small correlation between the binding anity of a TFBS and its functional
importance assessed via its rms-score (Figure 2.1; the Pearson correlation coecient CC = 0:30). In what
follows, we investigate how this small correlation leads to the specic variability of the binding energy prole
during the simulated evolution of the regulatory sequences.
15
Figure 2.1: Correlation between the binding energy E and the log-transformed regulatory rms-
score of TFBSs from the gap gene regulatory regions. The scatterplot is computed for the wild-type
conditions, i.e. for the initial regulatory sequence.
2.3.1 Sequence simplication
Evolution was simulated under an elevated mutational pressure (mutation rate = 0:001) leading to a
decrease in regulatory sequence complexity, as there was a signicant drop in the total number of TFBSs
for almost all TFs (Figure 2.2A). The dynamics of the total number of sites is balanced by a steep decrease
in the the number of initial sites and a saturating increase of the number of sites born throughout evolution
(Figure 2.2B). The sequence reacts to the elevated mutation rate by changing its TFBS composition, with
new sites surviving twice as long as the initial sites. The nal quasi-steady numbers of TFBS for each TF is
a result of mutation selection balance.
The TFBS disappearance rate also depends on TFBS modify redundancy, determined by the number
of sequence motifs of similar anity and nucleotide sequence available to a TF. Such a TF should have
a large number of self-overlapping sites and, as a consequence, a higher probability of site loss because a
single substitution in the overlapping region is able to destroy multiple sites simultaneously. We can use the
fraction of self-overlapping events in the initial set of TFBSs as a measure of such redundancy. When this
fraction is large, the double site loss due to mutation in the overlapping region is expected to be signicant,
leading to a decrease in this fraction as the total number of sites decreases. On the other hand, when the
fraction of self-overlaps is initially small, single sites will disappear predominantly with time, hence, the
fraction of self-overlaps will increase or stay constant as the total number of sites decreases.
16
We observe a relationship between Hb TFBS motif redundancy and Hb TFBS loss: there is a high positive
correlation between the fraction of self-overlaps and the total number of TFBSs (Figure 2.3A,C). This is
consistent with the fact that Hb TFBSs exhibit the steepest decline in Figure 2.2A. Initially Hb had the largest
number of binding sites because it binds to self-repeating polyA and polyT segments (Figure 2.3B) and, thus,
has many self-overlapping sites. This eect is present in other TFs but to a lesser degree (Figure 2.3C), i.e.
the larger initial fraction of self-overlaps, the more sites are lost due to the double-site mutations.
The fraction of overlapping events among all TFBSs (not only for the same TF) also decrease with time
(Figure 2.3D), showing that the same logic holds for heterotypic overlapping. The essential part in this
fraction is due to Hb sites, however excluding these sites does not change the reduction trend.
Figure 2.2: Dynamics of the number of TFBSs. A: The generation average number of all binding
sites for each TF is shown normalized by the initial value of this number. These are the generation average
initial/nal numbers of TFBSs for each TF: 507/135 (Hb), 122/57 (Kr), 125/60 (Gt), 116/43 (Kni), 150/69
(Bcd), 184/66 (Cad), 121/58 (Tll), 55/58 (Hkb). B: The dynamics for all TFBSs (red), the initial set of sites
(blue), and sites born during the simulation (green). The red and green curves were obtained by averaging
the corresponding numbers of sites across the 20 best individuals in the population, while the blue curve
stems from the site trajectories tracked across a randomly chosen sequence from the 20 best t individuals
from each generation.
17
Figure 2.3: Analysis of TFBS overlapping events. A: The generation average number of self-overlapping
events for Hb TFBSs vs. the generation average total number of TFBSs for the TF. The number of self-
overlaps in a generation is shown as a fraction of the average total number of Hb sites in this generation. B:
The sequence logo of the consensus TFBS motif for Hb. The analogs of (A) and (B) for other TFs are shown
in Appendix 2.5. C: The number of TFBS self-overlaps for each TF in the initial regulatory sequence, as a
fraction of the initial number of sites for each TF, vs. the correlation coecients for the TF-specic analogs
of the scatterplot shown in (A). D: The dynamics of the population average number of overlapping events
between all TFBSs (magenta) and between all but Hb sites (blue). The number of overlaps is computed as
a fraction of the population average total number of sites under consideration.
2.3.2 Variability of binding energy for dierent sets of sites and sequence an-
notations
We follow individual TFBSs (referred to as \tracked sites"), present in high tness individuals, through
the course of the simulation. Tracked sites' binding energy distributions dier signicantly from newly
appearing sites' distributions, and are also distinct from the total distribution (Figure 2.4A{C). The tracked
18
sites exhibit an excess of higher energy sites compared to the new sites (Figure 2.4A). As the new sites
eventually form the majority (Figure 2.2B), the total E distribution follows that for the new sites. Mean
energy for the tracked sites increases with time, whereas the new sites' mean energies oscillate stochastically
around a quasi-steady level (Figure 2.4B). The tracked sites also have lower energy variability compared
with the majority of sites (Figure 2.4C). These results suggest that, despite being in an environment of vast
sequence variability, the initial TFBSs do not dissolve with time in newly born sites and keep robust function
as a whole.
The within-generation distributions of E for the tracked TFBSs are indistinguishable between the rst
and mid generations (p = 0:32 based on the bootstrapped Kolmogorov{Smirnov (KS) test), while for the last
generation the distribution shows a small bias towards largerE values (p = 0:02; Figure 2.5A). The statistical
dierence with the rst generation becomes persistent only after approximately the 2500th generation and
demonstrates rather damping oscillations before that (Appendix 2.5). These oscillations correspond to
stochastic events during evolution associated with the appearance of many low energy sites over a short time
period (Figure 2.4D), in opposition to the main trend of such site loss within the tracked sites.
To further examine sequence simplication, we consider a coarser grained sequence annotation and com-
pare its energy prole properties with TFBS energies. We arbitrarily break each gap gene regulatory sequence
into 180 nucleotide bins (or segments), resulting in 104 bins for the four genes. For each segment, we dene
a binding energy (per bp) E
bin
as the sum of the TFBS binding energies within each segment, normalize
by segment length, and average across individuals in a generation. Therefore, E
bin
represents the `binding
capacity' of a segment: the more high anity sites it contains, the larger E
bin
it has.
This annotation leads to a qualitatively dierent picture of binding anity variation. While the tracked
TFBSs evolve towards a larger average energy (Figure 2.4B, Figure 2.5A), the dynamics of the E
bin
distri-
bution exhibits a shift towards smaller mean values and smaller variances (Figure 2.5B). This shift is related
to the loss of TFBSs, as fewer sites in a bin lead to smaller energy weights of nucleotides within each bin.
The E
bin
distribution is rst distinguished from the distribution at the initial generation at approximately
generation 220. Calculating the dierence dE =EE
wt
between the binding energy during evolution with
its wild-type (initial) value E
wt
for the sites and bins, we see that, in contrast to the individual TFBSs, the
dierencedE
bin
=E
bin
E
wt
bin
for the bins has a distribution shifted towards negative values (Figure 2.5C,D).
Therefore, the sequence simplication due to the elevated mutational pressure leads to opposite trends at
the level of the binding energy for dierent sequence annotations.
19
Figure 2.4: Binding energy variability for tracked and new TFBSs. A: Probability density functions
(PDF) for the last generation distribution of the binding energy E for all TFBSs (red), new TFBSs (green),
and sites tracked from the initial set of sites (blue). B: Dynamics of the mean energy for the same three
sets of sites as in (A), with the same colour code. For all and new sites, the energy was averaged across the
20 best individuals, and for the tracked sites, across the sites of a randomly chosen sequence from the 20
best t individuals. C: Temporal distributions of the coecient of variation (CV=SD/mean) of E for the
same three sets of sites as in (A) and (B), with the same colour code. The CV values correspond to the E
variation within the population in a given generation, and the plotted distributions comprise CV values for
all generations. The rst 100 generations were not taken into account in the distributions due to a small
number of new sites in that period (see Figure 2.2B). D: Dynamics of the low energy sites number to the
high energy sites number ratio for the tracked TFBSs. The sites were classied to the low and high energy
sets using E = 4 as the threshold. The arrows mark two stochastic bursts of the amount of low energy sites
during the evolution.
Figure 2.5: Binding energy variability for the tracked TFBSs and sequence segments. Results for
TFBSs are shown in (A, C), and for the sequence segments (bins) in (B, D). A, B: The probability density
functions (PDFs) for the within-generation distribution of the binding energyE are shown for three dierent
generations: the rst, mid, and last ones. C, D: Distributions of dE = EE
wt
values for all TFBSs (or
bins) and all generations.
20
2.3.3 Correlations with rms-scores of individuals
The per generation mean and normalized SD of individuals'rms-scores saturates rapidly with a slight decline
of the CV through time (Figure 2.6A,B). We may consider these dynamics as a stochastic
uctuation of the
system around a quasi-steady state of constant tness. We also see this fast saturation in the gure of
correlation between mean rms-score and mean energy for various sets of TFBSs (Figure 2.6C,D), where an
rms-score increase is accompanied with a decrease (increase) of the mean E for all (tracked) sites. These
gures also show that, in the quasi-steady tness regime, there are no signicant correlations between tness
of the individual and the mean energy. The same is true for correlation between the CVs forE andrms-score
(Appendix 2.5).
Figure 2.6: The rms-score dynamics and correlation with binding energy. A, B: Dynamics of
the generation mean (A) and coecient of variation (B) for individual rms-scores. C: Correlation (across
generations) between the generation meanrms-score of individuals and the generation meanE for all TFBSs.
The red points correspond to the rst 1100 generations, the green ones to the next 1100 generations, and
the blue points to the last 1150 generations. D: The same as in (C) but for the tracked TFBSs, with the
same colour code. The mean E in this case is computed across all tracked sites in the best tted individual
in a given generation.
2.3.4 Mutual correlations between sets of sites and sequence segments
The rapid change of binding energies throughout evolution inevitably leads to some general patterns of
correlations between temporal proles ofE for distinct TFBSs and their groups. One such correlation pattern
21
corresponds to the fact that low anity sites are maintained through time only if their binding energies
increase, whereas high anity sites are more likely to experience an energy decrease. As a consequence, we
may expect higher positive correlations within the groups of low and high anity binding sites.
In order to look at such correlations, we arranged the tracked TFBSs in order of their, increasing, initial
(wild-type) binding energies and partitioned this list of sites into non-overlapping 35 sets. Then for each
generation, we averaged energy inside each site set. Finally, we correlated the samples of averaged energies
for all generations for any pair of the site sets (Figure 2.7A). The clusters of positive correlations in the lower
left and the upper right corners of this gure correspond to the described correlation pattern for the low and
high anity sites, respectively. This general correlation pattern may hide functional correlations between
TFBSs at the level of binding energy, but it can be used as a background model for nding such functional
correlations by statistical methods.
We can also use the correlation matrices to answer the question about how fast the binding energy
distribution in the sequence changes throughout evolutionary time. In this case, we calculated correlations
between the distribution of E
bin
in the sequence for a given generation with the same distributions for
all other generations (Figure 2.7B). The narrow regions of small correlations in this gure illustrates the
two properties of the evolutionary simulation: the rapid early change of the energy prole and the later
saturation, making more distant generations hardly distinguishable.
Figure 2.7: Correlation matrices for E proles for TFBS clusters and sequence bins. Left:
Correlations of the temporal E proles for the sets of TFBSs of varying anity. Set #1 (#35) corresponds
to the tracked sites initially having the smallest (largest)E. Right: Correlations of the sequenceE
bin
proles
for all pairs of generations. See the text for full details of these gures.
22
2.3.5 Core TFBSs and their vicinity
The extensive loss of tracked TFBSs through time leads to a broad longevity distribution, with a mode
corresponding to longer-living TFBSs (Figure 2.8A). We checked whether TFBS lifespans change during
evolutionary time, i.e. whether later born sites display increased or decreased longevity compared with
earlier born ones. We compared the lifetime distribution of the tracked binding sites for sites that were born
and then died within the rst 500 generations with those from the last 500 generations, and did not nd a
statistical dierence between these distributions (Appendix 2.5). However, if we take into account the initial
sites that die in the rst 500 generations, the average lifespan for the group of early sites becomes larger
than that for the later sites.
We extracted 85 TFBSs that remained in the population for the entire simulation (`core binding sites')
and investigated their behavior (the placement of these sites in the sequence is illustrated in the genome-
browser like gures in Appendix 2.5). The analog of Figure 2.5A implies that the binding energy distribution
across core sites is highly conserved during evolution (Figure 2.8B: The appearance of the high-anity mode
in the last generation is not statistically signicant). The core sites are clearly segregated from other tracked
TFBSs with respect to their wild-type regulatory rms-scores (Figure 2.8C). Functionally important TFBS
are typically core sites as well.
We also studied how a TFBS's local sequence environment in
uences its evolutionary dynamics. We
considered two types of the tracked non-core TFBSs in the vicinity of the core sites. The rst type are sites
located no farther than 50 bps from at least one core site of the same TF, being thus under conditions of the
cooperative interactions with such core sites (Appendix 2.A). There are 61 such cooperative sites in total.
The second type are TFBSs overlapping with at least one core site (and not present in the set of cooperative
sites); there are 127 such sites. We call the joint set of TFBSs of both types neighboring sites.
Just like the core sites, the neighboring sites tend to be maintained for longer periods of time (Fig-
ure 2.8D), suggesting that sites within the vicinity of core TFBSs are more highly conserved during evo-
lution. Interestingly, the two types of the neighboring sites possess dierent distributions of the wild-type
rms-scores (Figure 2.8C). The cooperative sites are clearly distinguishable from the majority of sites and
hold a rather intermediate position between the full distribution and the distribution for the core sites. The
sites overlapping with the core ones are indistinguishable from the majority of sites, showing no essential
in
uence on gene expression for the wild-type conditions. This dierence can be explained by dierent types
of in
uences on the core sites. While the cooperative sites actively in
uence the core TFBSs, the sites
overlapping with the core TFBSs may live longer due to negative selection of mutations in the overlapping
23
region with the core site. Therefore the higher retention rate for the sites overlapping with the core TFBSs
can be interpreted as an example of a selective sweep.
The bimodality of therms-score distribution of the core sites is also related with the overlapping between
sites. The left peak of the red curve in Figure 2.8C is formed by the core sites that overlap with other core
sites (Addition le 2). As these sites have a small impact on gene expression (for the initial sequence), we
may suggest that their ultimate longevity is explained solely because they overlap with core sites that have
a stronger in
uence on expression.
Figure 2.8: Statistics for the core TFBSs and their vicinity. A: Distribution of lifetimes for the
tracked TFBSs. B: The within-generation distributions of the binding energy E of 85 core TFBSs shown
for three dierent generations (analog of Fig. 2.5A for the core sites). C: Distribution of the regulatory
rms-scores of TFBSs in the model (the strength of in
uence that a site exerts on gene expression) for the
initial wild-type conditions, for ve sets of TFBSs: core sites, non-core sites, all sites, sites cooperatively
interacting with the core sites, and non-core sites overlapping with the core sites. D: Distributions of lifetime
for the neighbouring TFBSs (red) and for all except core binding sites (blue). The dierence between the
red and blue distributions are statistically signicant under 5 percent level based on the bootstrapped KS
test.
2.4 Discussion
In this study, we have provided an example of regulatory sequence change through evolutionary time. We
pondered that given the large set of phenotypically and functionally equivalent regulatory schemes (when
varying TFBS quantity, anities, availability, etc.) that only a subset of these will be explored during
24
evolution, and are contingent upon specic acting population, molecular, and historical forces. Previously,
it was shown that high quantity TFBS sequences make up a relatively larger portion of acceptable sequence
space and are consequently more likely to be frequented during evolution [He et al., 2012]. Mutational
pressure seems to bias the evolutionary search of suitable regulatory sequences towards fewer TFBSs.
Under signicant mutational pressure, our simulations demonstrate that the cumulative number of phe-
notypically important TFBSs diminishes in concert with an increase in average binding site anities. This is
likely a direct consequence of the mutation rate, as having fewer TFBS, and thus fewer functionally relevant
nucleotides, minimizes the mutational target of the genome. Despite this constraint, the Drosophila gap gene
network ably compensates by tweaking the binding energies of the remaining TFBSs. The observed change
in low- and high-anity binding sites is likely a result of compensatory actions. The trade-o between TFBS
frequency and anity seems to be a general feature of regulatory networks [Stewart and Plotkin, 2013]. This
may also be consistent with claims that much of regulatory sequence organisation is predominantly shaped
by neutral forces [Lynch, 2007a].
In this simulation, we do not only observe sequence drift, but we also notice qualitative changes in higher
level descriptors of the regulatory network (such as binding energy). One might reasonably expect that
wild-type regulatory sequences and higher-level network descriptors would already be optimized and at equi-
librium, and that any of the observed qualitative changes here are a consequence of specic experimental
conditions. This is seemingly a parsimonious explanation of our results and further reveals the specic in
u-
ence of the mutation rate on the equilibrium values of higher-level network descriptors. If fewer TFBSs than
are present in the wild-type sequence can achieve nearly identical expression patterns, is there a selective
advantage to the wild-type regulatory strategy, or is it merely a consequence of neutral processes [Lynch,
2007a, Lusk and Eisen, 2010, He et al., 2012]? Our results showing slightly larger deviations of expression in
derived populations could hint at the involvement of selective forces (overwhelmed in the present study). It
is also important to note that our methodology focuses solely on several minutes of early development and
our articial version of selection is restricted to assessing brief spatiotemporal expression patterning, neglect-
ing any extraneous potential functionality. Consequentially, a population at or near selective equilibrium,
experimentally subjected to only a subset of selective criteria, may incorrectly appear to be unnecessarily
complex.
One non-neutral possibility is that the more TFBSs involved in regulation, the more ne-tuned expres-
sion patterns can be. However, the increased control provided by additional TFBSs may diminish as a
function of TFBS quantity. As such, the benets provided by each additional ne-tuning site will eventually
25
be outweighed by the cost of maintaining a larger mutational target. Although beyond the scope of the
present study, these trade-os, if necessary, may be better understood and predicted within the context of
population genetic modeling. Future research can add mathematical clarity to the above speculations and
further elaborate the connections among specic population parameters and network descriptor equilibriums.
Originally developed within the context of RNA evolution and mutation, mathematical models that examine
the trade-os among replicative delity of a sequence, sequence length, and mutation rates, may provide
further insight [Eigen et al., 1988].
Core TFBSs, TFBSs that remain throughout the entire evolutionary history of the simulation, are iden-
tied and shown to be extremely important towards maintaining wild-type expression patterns. In contrast,
many sites with smaller phenotypic eects turnover often, and are suggested candidates responsible for ob-
served sequence drift in actual biological populations. Another result of this simulation, consistent with
previous work [Kozlov et al., 2014, 2015a], is the fairly weak correlation between TFBS energy and the
impact of its removal on the entire systems dynamics. This further illustrates the necessity of realistic com-
putational genotype-phenotype maps to assigning evolutionary functionality. The selective importance of
many TFBSs can only be deeply understood within the context of the system it operates in, and not solely
from local measurements.
We also note that TFBSs within close proximity of core sites are retained at a higher rate than other non-
core TFBSs, despite sharing a similar perturbation distribution with other non-core, non-proximal binding
sites. In accordance with our expectations of the in
uence of mutational pressure, this pattern may be a
consequence of these non-core TFBSs sharing nucleotides with neighboring core TFBSs, and thus containing
fewer mutable nucleotides. Others also observed increased longevity of overlapping binding sites [Lusk and
Eisen, 2010]. Additionally, close core proximity TFBSs may be favorable due to cooperative interactions of
these sites with the core sites.
The presented results have some limitations. We analyzed the results of a single evolutionary simulation
with a xed mutation rate. New simulations with varying parameter schemes are likely to clarify and rene
our understanding of sequence evolution. In particular, the core TFBSs and other aspects may be dierent
in other simulations, as evolution is a stochastic process. Despite the complexity of our model and its
detailed focus on molecular mechanisms, it has mechanistic limitations as well. Only a subset of the known
mechanistic details in regulation are simulated and, in particular, the 3D genome organisation, is presently
neglected. Models that include these details might disagree on the position of DNA and on the regulatory
role of certain TFBSs, and potentially lead to dierent observations. Overcoming these modeling limitations
26
is a desirable condition for future evolutionary studies.
2.5 Conclusions
Our simulations of genetic regulatory network evolution suggest that in response to elevated mutational
pressure, the size of the functional regulatory sequence decreases to minimize risk. To compensate for this
organisational constraint, TFBSs with, on average, stronger binding anities are selectively maintained. The
small correlation between TFBS anity and functional importance means that gene network evolution tends
towards sequence organisations having many weak TFBSs working in concert. This can make conclusions
solely based on the analysis of the binding anity landscape vague and incomplete. However, we show that
core TFBSs, which form the regulatory backbone of the network, are highly conserved and can be identied
at the level of binding energy dynamics. TFBSs that interact with these core binding sites also exhibit
increased longevity. Despite the present study's focus on a specic system and parameter regime, its results
will likely be relevant to more general studies of the TFBS evolutionary landscape, and its conclusions useful
to systems lacking genotype-phenotype maps.
Acknowledgements
We thank Ivan Kulakovskiy and Peter Ralph for helpful discussions. The work was supported by the Russian
Science Foundation, grant number 14{14{00302 (analysis of regulatory sequences), by the National Institute
of Health, grant numbers U01 GM103804, RO1 GM098741 (development of the simulation program), and
by the Russian Fund for Basic Research, grant number 14{04{01522a (development of the analysis tools).
Chapter 2 Appendix
2.A Model equation and parameter values.
27
Additional file 1: Gene expression model
In this text, we present details of Model 4 from (Kozlov et al., 2015), which was used
for the evolutionary simulations. This model describes the expression of gap genes hunchback
(hb), Kruppel (Kr), giant (gt), and knirps (kni)duringtheblastodermstageof Drosophila
melanogaster development. These genes participate in the formation of the segmentation
prepattern underlying the body plan of the fruit fly (Jaeger, 2011). The embryo consists
of nuclei undergoing a series of synchronous divisions during this stage. We consider a time
lapse from the 13th nuclear cleavage cycle to the end of the cycle 14A, and nuclei from the
anterior-posterior (A-P) axis of the embryo ranging from 35% to 92% of the embryo length.
The expression dynamics of the gap gene network inside each nucleus is according to the
following reaction-di↵ usion equations (Kozlov et al., 2015, 2014):
du
a
i
dt
=R
a
u
E
a
i
(t) a
u
u
a
i
+D
a
u
(n)[(u
a
i 1
u
a
i
)+(u
a
i+1
u
a
i
)], (1)
dv
a
i
dt
=R
a
v
u
a
i
(t ⌧ a
) a
v
v
a
i
+D
a
v
(n)[(v
a
i 1
v
a
i
)+(v
a
i+1
v
a
i
)], (2)
where u
a
i
and v
a
i
are mRNA and protein concentrations, respectively, of the products of gene a
in nucleus i, R
a
v
and R
a
u
are maximum synthesis rates, D
a
v
and D
a
u
are di↵ usion coe cients, a
v
and a
u
are decay rates, n is the cleavage cycle number, and the ⌧ a
are time delay parameters
accounting for the delay between transcription initiation and protein appearance. These equa-
tions implement three processes for the gene products (from left to right on the right hand side
of the equations): synthesis, decay, and di↵ usion (nucleus-to-nucleus transport). The mRNA
synthesis term depends on the probability of transcriptional activation E
a
i
(t)ofgene a in nu-
cleus i at time t. The protein synthesis is proportional to the mRNA concentration. The list
of transcription factor (TFs) regulating the activity of the four gap genes include their own
proteins (Hb, Kr, Gt, and Kni) and the proteins of four external regulators: Bicoid (Bcd),
Caudal (Cad), Tailless (Tll), and Huckebein (Hkb).
The probability of transcriptional activation E
a
i
(t)isinterpretedasbeingequivalentto
the fractional occupancy of the promoter by the basal transcriptional machinery (BTM) and
calculatedusingathermodynamicapproachintheformproposedbyHeetal.(2010)andfurther
modified in our papers (Kozlov et al., 2014, 2015). In what follows, we describe derivation of
this probability omitting the indices a and i and time variable t for brevity.
The statistical thermodynamics approach counts all possible configurations of the complex
that consists of the regulatory region of a gene together with its basal promoter. We assign the
‘ON’ state to those configurations of the complex in which the BTM is bound to the promoter,
andthe‘OFF’statetotheconfigurationsinwhichthepromoterisunoccupied. Aconfiguration
of the regulatory region is indexed by a vector = { (s)} containing the information about
the occupation status of the transcription factor binding sites (TFBSs) in the region: (s)=0
if TFBS s is free and (s)=1ifthissiteisoccupied. Theoccupancyofsite s is characterized
by the statistical weight q
s
:
q
s
=Kvexp( P
s
), (3)
1
Chapter 2 Appendix
2.A Model equation and parameter values.
28
where K is the association constant for a strongest TFBS (one constant per TF), v is the
protein concentration for the TF binding to site s, and P
s
is the di↵ erence between a PWM
(positional weight matrix) score of site s and the score of the strongest site.
Given the statistical weights of all TFBSs in a configuration of the regulatory region, we
calculate the statistical weight W
of this configuration as follows:
W
=
Y
s
(C
s
q
s
)
(s)
, (4)
where the exponentiation to the power of (s)guaranteesthatonlytheweights q
s
of occupied
TFBSsarepresentinW
. TheparameterC
s
accountsfortwopossibletypesoflocalinteractions
between TFs bound to site s and to other sites in a given sequence range: cooperative binding
and short-range repression. If site s binds TF cooperatively with neighboring sites j of the
same TF, we have
C
s
=
Y
j
!
(j)
, (5)
where! > 1isthecooperativityparameter(oneparameterperTF).Theshort-rangerepression
mechanism provides the inhibition of the target gene by inactivating TFBSs for activators
(forbidding the binding to these sites) in the vicinity of TFBSs occupied by repressor TFs.
According to this mechanism, a TFBS occupied by a repressor can be in two states: e↵ ective
and ine↵ ective. It is assumed that the DNA in the vicinity of the site in the e↵ ective occupied
state changes its local conformation making this vicinity inaccessible for binding, while the
ine↵ ective state does not influence the local DNA. If site s binds TF-repressor and is in the
ine↵ ective state in configuration , we write C
s
=1. Forthee↵ ective state, we have C
s
= ,
where the parameter quantifies the repression e ciency (one parameter per TF-repressor
and target gene). The cooperativity and repression parameters are multiplied in C
s
if the
corresponding mechanisms coexist for a given site and configuration.
A configuration of the regulatory region with the weight W
may lead to either free basal
promoterorthepromoteroccupiedbytheBTM,i.e. toeitherOFForONstateofthecomplex,
respectively. The statistical weight of the ON state equals to W
Q
, whereQ
accounts for the
interaction with the BTM of activator TFs bound in the regulatory region. Under the limited
contacthypothesis,weassumethatnotmorethanN activatorTFscaninteractsimultaneously,
either directly or via adaptor factors, with the BTM (He et al., 2010). Therefore, we have for
Q
:
Q
=
N
X
k=1
X
i
1
,...,i
k
↵ (s
i
1
)···↵ (s
i
k
), (6)
where the summation is taken over all di↵ erent k-length (k N)combinationsofactivator
TFBSs (s
i
1
,...,s
i
k
) occupied in the configuration ,and ↵ (s)= ↵ TF(s)
is the activation e -
ciency parameter for each activator TF (one constant for each activator TF and target gene).
The statistical weight of the OFF state of the complex equals to W
, i.e. the relative weight of
the empty promoter is set to unity.
Finally, the probability of transcriptional activation is calculated as a fractional occupancy
of the basal promoter in the complex with the regulatory region:
E =
Z
ON
Z
ON
+Z
OFF
,Z
OFF
=
X
W
,Z
ON
=
X
W
Q
, (7)
where Z
ON
and Z
OFF
sum up statistical weights of all possible states of the complex with the
BTM bound and not bound to the basal promoter, respectively (He et al., 2010; Kozlov et al.,
2014, 2015). As the TF protein concentrations vary for genes a, nuclei i, and times t, i.e.
v =v
a
i
(t) in eq. (3), we have the same dependence for E: E =E
a
i
(t).
2
29
The model also implements a possible dual regulatory action of TF a on target gene b,
when the type of action (either activation or repression) depends on the TF concentration v
a
i
:
it is activation for small concentrations and repression for large ones. In this case, we replace
a single parameter quantifying the regulatory action strength (either activation e ciency ↵ or repression e ciency ) by three parameters: a threshold concentration V
a
, an activation
strength ↵ for the case when v
a
i
V
a
, and a repression strength for the case when v
a
i
>V
a
.
This dual regulatory interactions are assumed for the mutual regulation of gap genes hb and
Kr (Kozlov et al., 2015).
The parameter values (Table S1) were obtained by fitting the model output to the wild
type gap gene expression data at cellular resolution (Pisarev et al., 2009), as described by Ko-
zlov et al. (2015). We used methods of cross-validation analysis, fitting to nonsense data, and
local identifiability analysis to show previously that the model is relatively stable to overfit-
ting (Kozlov et al., 2014). The model predictions for parameter values from Table S1 were
successfully tested on the expression data for a number of experimentally characterized genetic
constructs (Kozlov et al., 2015).
References
Xin He, Md Abul Hassan Samee, C Blatti, and Saurabh Sinha. Thermodynamics-based models
of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative
binding and short-range repression. PLoS computational biology, 6(9):e1000935, 2010.
Johannes Jaeger. The gap gene network. Cellular and molecular life sciences : CMLS,68(2):
243–274, January 2011.
K N Kozlov, Vitaly V Gursky, Ivan Kulakovskiy, and Maria G Samsonova. Sequence-based
model of gap gene regulatory network. BMC genomics, 15(Suppl 12):S6, 2014.
K N Kozlov, Vitaly V Gursky, Ivan V Kulakovskiy, Arina Dymova, and Maria G Samsonova.
Analysis of functional importance of binding sites in the Drosophila gap gene network model.
BMC genomics, 16(Suppl 13):S7, December 2015.
Andrei Pisarev, Ekaterina Poustelnikova, Maria G Samsonova, and John Reinitz. FlyEx, the
quantitative atlas on segmentation gene expression at cellular resolution. Nucleic Acids
Research, 37(Database issue):D560–6, January 2009.
3
30
Table S1. Parameter values in the model.
Hb Kr Gt Kni Bcd Cad Tll Hkb
↵ / for hb 426 6525/126 518 8609 635 290 1068 266
↵ / for Kr 2148/200 281 697 353 638 465 1374 305
↵ / for gt 1978 3926 4840 469 15 97 565 2997
↵ / for kni 2109 572 460 205 12 1020 1516 1740
K
a
0.0001 0.0001 0.0475 0.0471 0.0125 0.0121 0.0498 0.0120
!
a
1 1.1 1.5 5.2 1 1 4.7 1
⌧ a
3.5 1.3 7.3 1.0 – – – –
R
a
u
0.0760 0.0954 0.0951 0.1050 – – – –
R
a
v
11.10 20.55 20.19 9.70 – – – –
a
u
1.14 1.38 3.07 1 – – – –
a
v
1.55 16.61 8.93 8.28 – – – –
The first four rows contain the activation e ciency parameters (↵ , shown in red) and the
repression e ciency parameters ( , shown in blue) for each TF (in columns) and target gene
(in rows). Each TF except Hb and Kr can be either activator or repressor for a given target
gene. The regulatory interactions between genes hb and Kr are of the dual type, so the action
of Hb on Kr and Kr on hb is characterized by two action strength parameters ( /↵ ). The
protein concentration thresholds delimiting the activating and repressing actions (see the
text) are V
Kr
=13.40 and V
Hb
=52.27 (a.u.) for TFs Kr and Hb, respectively. K
a
is the
a nity constant for the strongest binding site of each TF. !
a
is the cooperativity parameter.
⌧ a
is the delay time (in minutes). R
a
u
and R
a
v
are the maximal synthesis rates for mRNA and
protein, respectively. a
u
and a
v
are log2/halftime (in minutes
1
) for mRNA and protein,
respectively. The parameters from the last five rows are only for the four gap genes. The
maximal number of activator molecules allowed to interact with the BTM: N=3. The
repression range (in basepairs) for the short-range repression mechanism: d = 198. An
additional parameter q
BTM
=4.5⇥ 10
6
is a factor by which Z
ON
from Eq. (7) is multiplied.
This parameter represents the basal level of the BTM-promoter interaction, so the statistical
weights corresponding to all other ON-states are counted from that level (He et al., 2010). All
parameter values in the table and listed above were found by the di↵ erential evolution
optimization method minimizing the di↵ erence between the model output and the wild type
gap gene expression data. The following di↵ usion parameters from the model equations were
fixed during optimization: D
a
u
=0.05 and D
a
v
=0.005 for each gene a; these are values for the
cleavage cycle 14A, and they are divided by 4 for the previous cycle since the internuclear
distance in cycle 13 is twice the distance in cycle 14A.
4
31
1
Supplementary figures
Figure S1. Analogs of panel (A) (left column) and panel (B) (right column) of Figure 3 from the
main text, for all TFs except Hb.
2.B Supplementary figures.
32
2
Figure S2. Figure with the dynamics of the p-value for the E distribution difference from the
first generation (for the tracked TFBSs).
Figure S3. The same as in Figure 6C,D from the main text, but with the coefficient of variation
(CV=SD/mean) instead of the mean.
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
Generation (10 x gen.)
p-value
�� �� �� �� �� �� �� �� �� �� �� �� �� ��
�� ��
�� ��
�� ��
�� ��
�� ��
�� ��
�� � �� �
�� � �� ���- �����
�� �� �� �� �� �� �� �� �� �� �� �� �� ��
�� ��
�� ��
�� ��
�� ��
�� ��
�� ��
�� � �� �
�� � �� ���- �����
33
3
Figure S4. The expression patterns (protein concentrations) for four gap genes from the model
(red and black curves) and from data (blue dots), at the end of the cleavage cycle 14A of
Drosophila development. The red curves represent the model solution for the reference genotype
(the initial genotype in the evolutionary simulation). The black curves are the model solutions
for the 20 best fitted sequences from the last generation in the evolutionary simulation (they are
hardly visible because of too small deviations from the red curve). The horizontal axis shows the
position at the anterior-posterior (A-P) axis of the embryo in terms of percent of the embryo
length.
Figure S5. Distribution of the binding energy E for all binding sites in a sequence, for 20 best
fitted sequences in generation 10 (A), 1000 (B), and 3350 (C).
34
4
Figure S6. Distributions of lifetime of the tracked binding sites that both are born and die in the
first 500 generations (red) or in the last 500 generations (green). In A, the red sample is
additionally augmented with the initial binding sites that die in the first 500 generations. This
makes the two distribution statistically different. In B, such initial binding sites are not taken into
account, and the distributions become indistinguishable.
Figure S7. Distribution of the regulatory rms-scores of TFBSs in the model for the initial
sequence, for two sets of TFBSs: all core sites (A) and the core sites that do not overlap with
other core sites (B). All but one sites from the left mode of the distribution in A are involved in
overlapping with core sites.
35
5
Figure S8. The organization of the regulatory region for the gene Kruppel. DNA sequence is
shown below with the coordinates according to the dm3 assembly of the Drosophila
menalogaster reference genome. The coding sequence is in red, and the DNAse I accessible
domains are in green. The short vertical black lines on the sequence mark starting positions of
TFBSs. The pink triangles show the positions of the core binding sites. The blue boxes above the
sequence correspond to the placement of known enhancers, according to RedFly database.
Figure S9. The same as in Figure S8, but for the gene hunchback.
36
6
Figure S10. The same as in Figure S8, but for the gene knirps.
Figure S11. The same as in Figure S8, but for the gene giant.
37
Part II
Evolutionary System Theory
38
Chapter 3
System drift and speciation
Abstract
Even if a species' phenotype does not change over evolutionary time, the underlying mechanism may change,
as distinct molecular pathways can realize identical phenotypes. Here we use linear system theory to explore
the consequences of this idea, describing how a gene network underlying a conserved phenotype evolves,
as the genetic drift of small changes to these molecular pathways cause a population to explore the set of
mechanisms with identical phenotypes. To do this, we model an organism's internal state as a linear system
of dierential equations for which the environment provides input and the phenotype is the output, in which
context there exists an exact characterization of the set of all mechanisms that give the same input{output
relationship. This characterization implies that selectively neutral directions in genotype space should be
common and that the evolutionary exploration of these distinct but equivalent mechanisms can lead to
the reproductive incompatibility of independently evolving populations. This evolutionary exploration, or
system drift, proceeds at a rate proportional to the amount of intrapopulation genetic variation divided
by the eective population size (N
e
). At biologically reasonable parameter values this process can lead to
substantial interpopulation incompatibility, and thus speciation, in fewer than N
e
generations. This model
also naturally predicts Haldane's rule, thus providing another possible explanation of why heterogametic
hybrids tend to be disrupted more often than homogametes during the early stages of speciation.
This chapter is available on the bioRxiv:
JS Schiman and PL Ralph. System drift and speciation. 2018. bioRxiv.
Back to the Table of Contents
39
3.1 Introduction
It is an overarching goal of many biological subdisciplines to attain a general understanding of the function
and evolution of the complex molecular machinery that translates an organism's genome into the character-
istics on which natural selection acts, the phenotype. For example, there is a growing body of data on the
evolutionary histories and molecular characterizations of particular gene regulatory networks [Jaeger, 2011,
Davidson and Erwin, 2006, Israel et al., 2016], as well as thoughtful verbal and conceptual models [True and
Haag, 2001, Weiss and Fullerton, 2000, Edelman and Gally, 2001, Pavlicev and Wagner, 2012]. Mathemati-
cal models of both particular regulatory networks and the evolution of such systems in general can provide
guidance where intuition fails, and thus has the potential to discover general principles in the organization
of biological systems as well as provide concrete numerical predictions [Servedio et al., 2014]. There is a
substantial amount of work studying the evolution of gene regulatory networks, in frameworks both abstract
[Wagner, 1994, 1996, Siegal and Bergman, 2002, Bergman and Siegal, 2003, Draghi and Whitlock, 2015] and
empirically inspired [Mjolsness et al., 1991, Jaeger et al., 2004, Kozlov et al., 2015b, Crombach et al., 2016b,
Wotton et al., 2015, Chertkova et al., 2017].
It is well known that in many contexts mathematical models can fundamentally be nonidentiable and/or
indistinguishable { meaning that there can be uncertainty about an inferred model's parameters or even its
claims about causal structure, despite access to complete and perfect data [Bellman and
Astr om, 1970,
Grewal and Glover, 1976, Walter et al., 1984]. Models with dierent parameter schemes, or even dierent
mechanics can make equally accurate predictions, but still not actually re
ect the internal dynamics of the
system being modelled. In control theory, where electrical circuits and mechanical systems are often the
focus, it is understood that there can be an innite number of \realizations", or ways to reverse engineer
the dynamics of a \black box", even if all possible input and output experiments are performed [Kalman,
1963, Anderson et al., 1966, Zadeh and Deoser, 1976]. The inherent nonidentiability of chemical reaction
networks is sometimes referred to as \the fundamental dogma of chemical kinetics" [Craciun and Pantea,
2008]. In computer science, this is framed as the relationship among processes that simulate one another
[Van der Schaft, 2004]. Finally, the eld of inverse problems studies those cases in which, despite the existence
of a theoretical one-to-one mapping between a model and behavior, tiny amounts of noise make inference
problems nonidentiable in practice [Petrov and Sizikov, 2005].
Nonidentiability is a major barrier to mechanistic understanding of real systems, but viewed from
another angle, this concept can provide a starting point for thinking about externally equivalent systems {
systems that evolution can explore, so long as the parameters and structures can be realized biologically.
40
These functional symmetries manifest in convergent and parallel evolution, as well as developmental system
drift: the observation that macroscopically identical phenotypes in even very closely related species can in
fact be divergent at the molecular and sequence level [Kimura, 1981, True and Haag, 2001, Tanay et al.,
2005, Tsong et al., 2006, Hare et al., 2008b, Lavoie et al., 2010, Vierstra et al., 2014, Matsui et al., 2015,
Dalal et al., 2016, Dalal and Johnson, 2017].
The main purpose of this paper is to explore this general idea in the concrete framework of linear
systems theory, thought of as modeling gene regulatory networks. First, we apply results from system
theory which give an analytical description of the set of all linear gene network architectures that yield
identical phenotypes. This suggests that, quite generally, there are many directions in which a gene network
can drift without changing the phenotype. Since these phenotypically equivalent gene networks are not
necessarily compatible with one another, system drift may result in reproductive incompatibility between
isolated populations, even in the absence of any sort of adaptive, or environmental change. Does this neutral
system drift matter, i.e., lead to measurable consequences? To address this, we use quantitative genetic
theory to estimate how quickly reproductive incompatibility due to system drift manifests.
It is not a new observation that there is often more than one way to do the same thing, or that speciation
can be the result of (nearly) neutral processes. The potential for speciation has been analyzed in models
of traits under stabilizing selection determined additively by alleles at many loci [Wright, 1935, Barton,
1986, 1989, 2001], in related tness landscape models [Fra sse et al., 2016], and for pairs of traits that must
match but whose value is unconstrained [Sved, 1981]. It has also been shown that population structure
can allow long-term stable coexistence of incompatible genotypes encoding identical phenotypes [Phillips,
1996]. However, previous simulations of system drift in regulatory sequences [Tulchinsky et al., 2014] and
a regulatory cascade [Porter and Johnson, 2002] found rapid speciation under directional selection but only
equivocal support for speciation under models of purely neutral drift. The rate at which hybrid incompat-
ibility accumulates due to genetic drift creating segregation variance between isolated populations is fairly
well understood [Slatkin and Lande, 1994, Rosas et al., 2010, Chevin et al., 2014], but model assumptions
can strongly aect predictions, including whether variation is due to rare or common alleles [Slatkin and
Lande, 1994], and the shape of the tness landscape [Fra sse et al., 2016]. Our main aim is to provide a
concrete framework that can provide natural predictions of these model parameters across a general class of
models. Furthermore, tools from system theory allow analytical predictions to be made for large populations
with complex phenotypes that would be inaccessible to population simulations.
41
3.2 Results
We use a model of gene regulatory networks that describes the temporal dynamics of a collection of n co-
regulating molecules { such as transcription factors { as well as external or environmental inputs. We write
(t) for the vector ofn molecular concentrations at timet. The vector ofm \inputs" determined exogenously
to the system is denoted u(t), and the vector of ` \outputs" is denoted (t). The output is merely a linear
function of the internal state:
i
(t) =
P
j
C
ij
j
(t) for some matrixC. Since is what natural selection acts
on, we refer to it as the phenotype (meaning the \visible" aspects of the organism), and in contrast refer to
as the kryptotype, as it is \hidden" from direct selection. Although may depend on all entries of , it is
usually of lower dimension than, and we tend to think of it as the subset of molecules relevant for survival.
The dynamics are determined by the matrix of regulatory coecients, A, a time-varying vector of inputs
u(t), and a matrix B that encodes the eect of each entry of u on the elements of the kryptotype. The rate
at which the i
th
concentration changes is a weighted sum of the concentrations as well as the input:
_ (t) =A(t) +Bu(t)
(t) =C(t):
(3.1)
Furthermore, we always assume that (0) = 0, so that the kryptotype measures deviations from initial
concentrations. Here A can be any nn matrix, B any nm, and C any `n dimensional matrix, with
usually` andm less thann. We think of the system as the triple (A;B;C), which translates (time-varying)
m-dimensional inputu(t) into the`-dimensional output(t). Under quite general assumptions, we can write
the phenotype as
(t) =Ce
At
(0) +
Z
t
0
Ce
A(ts)
Bu(s)ds; (3.2)
which is a convolution of the input u(t) with the system's impulse response, which we denote as h(t) :=
Ce
At
B.
In terms of gene regulatory networks, A
ij
determines how the j
th
transcription factor regulates the i
th
transcription factor. IfA
ij
> 0, then
j
up-regulates
i
, while ifA
ij
< 0, then
j
down-regulates
i
. Thei
th
row of A is therefore determined by genetic features such as the strength of j-binding sites in the promoter
of genei, factors aecting chromatin accessibility near genei, or basal transcription machinery activity. The
form of B determines how the environment in
uences transcription factor expression levels, and C might
42
determine the rate of production of downstream enzymes. To demonstrate this approach, we apply it to
construct a simple gene network in Example 1 below.
Example 1 (An oscillator). For illustration, we consider an extremely simplied model of oscillating gene
transcription, as for instance is found in cell cycle control or the circadian rhythm. There are two genes,
whose transcript concentrations are given by
1
(t) and
2
(t), and gene-2 up-regulates gene-1, while gene-1
down-regulates gene-2 with equal strength. Only the dynamics of gene-1 are consequential to the oscillator
(perhaps the amount of gene-1 activates another downstream gene network). Lastly, both genes are equally
up-regulated by an exogenous signal. The dynamics of the system are described by
_
1
(t) =
2
(t) +u(t)
_
2
(t) =
1
(t) +u(t)
(t) =
1
(t):
In matrix form the system regulatory coecients are given as, A =
0 1
1 0
, B = [
1
1
], and C = [ 1 0 ]. Suppose
the input is an impulse at time zero (a delta function), and so its phenotype is equal to its impulse response,
(t) =h(t) = sint + cost:
The system and its dynamics are referred to in Figure 3.1. We return to the evolution of such a system
below.
1
2
input (u)
output ()
1
1
1 1
1
Figure 3.1: (Left) Diagram of the gene network in Example 1 , and (right) plot of the phenotype(t) against
time t .
43
3.2.1 Equivalent gene networks
As reviewed above, some systems with identical phenotypes are known to dier, sometimes substantially, at
the molecular level; systems with identical phenotypes do not necessarily have identical kryptotypes. How
many dierent mechanisms perform the same function?
Two systems are equivalent if they produce the same phenotype given the same input, i.e., have the same
input{output relationship. We say that the systems dened by (A;B;C) and (
A;
B;
C) are phenotypically
equivalent if their impulse response functions are the same: h(t) =
h(t) for all t 0. This implies that for
any acceptable input u(t), if (
u
(t);
u
(t)) and (
u
(t);
u
(t)) are the solutions to equation (4.1) of these two
systems, respectively, then
u
(t) =
u
(t) for allt 0:
In other words, phenotypically equivalent systems respond identically for any input.
One way to nd other systems phenotypically equivalent to a given one is by change of coordinates: if V
is an invertible matrix, then the systems (A;B;C) and (VAV
1
;VB;CV
1
) are phenotypically equivalent
because their impulse response functions are equal:
h(t) =Ce
At
B =CV
1
Ve
At
V
1
VB
=CV
1
e
VAV
1
t
VB =
Ce
At
B =
h(t):
(3.3)
However, not all phenotypically equivalent systems are of this form: systems can have identical impulse
responses without being coordinate changes of each other. In fact, systems with identical impulse responses
can involve interactions between dierent numbers of molecules, and thus have kryptotypes in dierent
dimensions altogether.
This implies that most systems have at least n
2
degrees of freedom, where recall n is the number of
components of the kryptotype vector. This is because for an arbitrary nn matrix Z, taking V to be the
identity matrix plus a small perturbation in the direction of Z above implies that movingA in the direction
of ZAAZ while also moving B in the direction of ZB and C in the direction ofCZ will leave the
phenotype unchanged to second order in the size of the perturbation. If the columns of B and the rows of
C are not all eigenvectors of A, then any such Z will result in a dierent system.
It turns out that in general, there are more degrees of freedom, except if the system is minimal { meaning,
informally, that it uses the smallest possible number of components to achieve the desired dynamics. Results
44
in system theory show that any system can be realized in a particular minimal dimension (the dimension of
the kryptotype,n
min
), and that any two phenotypically equivalent systems of dimension n
min
are related by
a change of coordinates. Since gene networks can grow or shrink following gene duplications and deletions,
these additional degrees of freedom can apply, in principle, to any system.
Kalman decomposition
Even if the system is not minimal, results from systems theory explicitly describe the set of all phenotypically
equivalent systems. We refer toN (A
0
;B
0
;C
0
) as the set of all systems phenotypically equivalent to the
system dened by (A
0
;B
0
;C
0
):
N (A
0
;B
0
;C
0
) =
(A;B;C) :Ce
At
B =C
0
e
A0t
B
0
fort 0
:
(3.4)
These systems need not have the same kryptotypic dimension n, but must have the same input and output
dimensions (` and m, respectively).
The Kalman decomposition, which we now describe informally, elegantly characterizes this set [Kalman,
1963, Kalman et al., 1969, Anderson et al., 1966]. To motivate this, rst note that the inputu(t) only directly
pushes the system in certain directions (those lying in the span of the columns of B). As a result, dierent
combinations of input can move the system in any direction that lies in what is known as the reachable
subspace. Analogously, we can only observe motion of the system in certain directions (those lying in the
span of the rows of C), and so can only infer motion in what is known as the observable subspace. The
Kalman decomposition then classies each direction in kryptotype space as either reachable or unreachable,
and as either observable or unobservable. Only the components that are both reachable and observable
determine the system's phenotype { that is, components that both respond to an input and produce an
observable output.
Concretely, the Kalman decomposition of a system (A;B;C) gives a change of basis P such that the
transformed system (PAP
1
;PB;CP
1
) has the following form:
PAP
1
=
2
6
6
6
6
6
6
6
4
A
r o
A
r o;ro
A
r o; r o
A
r o; ro
0 A
ro
0 A
ro; ro
0 0 A
r o
A
r o; ro
0 0 0 A
ro
3
7
7
7
7
7
7
7
5
;
45
and
PB =
2
6
6
6
6
6
6
6
4
B
r o
B
ro
0
0
3
7
7
7
7
7
7
7
5
(CP
1
)
T
=
2
6
6
6
6
6
6
6
4
0
C
T
ro
0
C
T
ro
3
7
7
7
7
7
7
7
5
:
The impulse response of the system is given by
h(t) =C
ro
e
Arot
B
ro
;
and therefore, the system is phenotypically equivalent to the minimal system (A
ro
;B
ro
;C
ro
).
This decomposition is unique up to a change of basis that preserves the block structure. In particular,
the minimal subsystem obtained by the Kalman decomposition is unique up to a change of coordinates.
This implies that there is no equivalent system with a smaller number of kryptotypic dimensions than the
dimension of the minimal system. It is remarkable that the gene regulatory network architecture to achieve
a given input{output map is never unique { both the change of basis used to obtain the decomposition
and, once in this form, all submatrices other than A
ro
, B
ro
, and C
ro
can be changed without aecting the
phenotype, and so represent degrees of freedom. (However, some of these subspaces may aect how the
system deals with noise.)
Note on implementation: The reachable subspace, which we denote byR, is dened to be the closure of
span(B) under applying A, and the unobservable subspace, denoted
O, is the largest A-invariant subspace
contained in the null space ofC. The four subspaces,r o,ro, r o, and ro are dened from these by intersections
and orthogonal complements { ro refers to the both reachable and observable subspace, while r o refers to
the unreachable and unobservable subspace, and similarly for ro and r o.
For the remainder of the paper, we interpretN as the neutral set in the tness landscape, along which a
large population will drift under environmental and selective stasis. Even if the phenotype is constrained and
remains constant through evolutionary time, the molecular mechanism underpinning it is not constrained
and likely will not be conserved.
Finally, note that ifB andC are held constant { i.e., if the relationships between environment, kryptotype,
and phenotype do not change { there are still usually degrees of freedom. The following Example 2 gives the
set of minimal systems equivalent to the oscillator of Example 1, that all share common B and C matrices.
46
The oscillator can also be equivalently realized by a three-gene (or larger) network, and will have even more
evolutionary degrees of freedom available, as in Figure 3.3.
Example 2 (All equivalent rewirings of the oscillator). The oscillator of Example 1 is minimal, and so any
equivalent system is a change of coordinates by an invertible matrix V . If we further require B and C to be
invariant then we need VB = B and CV = C. Therefore the following one-parameter family (A();B;C)
describes the set of all two-gene systems phenotypically equivalent to the oscillator:
A() =
1
+ 1
2
6
4
1
2( + 1) + 1
3
7
5 for 6=1:
The resulting set of systems are depicted in Figure 3.2.
1
2
input (u)
output ()
2
1
+1
1
+1
1 1
1
+1
+1
Figure 3.2: (Left) A(), the set of all phenotype-equivalent cell cycle control networks. (Right) Gene-1
dynamics (blue) for both systems A(0) and A(2) are identical, however, A(0) gene-2 dynamics (orange)
dier from A(2) (green).
47
1
2
3 input (u)
output ()
2:2
2
1
2:2
1
1
1
1
1
1
Figure 3.3: A possible non-minimal three-gene oscillator, phenotypically equivalent to A(), the systems in
Examples 1 and 2.
Sexual reproduction and recombination
Parents with phenotypically equivalent yet dierently wired gene networks may produce ospring with
dramatically dierent phenotypes. If the phenotypes are signicantly divergent then the ospring may be
inviable or otherwise dysfunctional, despite both parents being well adapted. If this is consistent for the
entire population, we would consider them to be separate species, in accord with the biological species
concept [Mayr, 2000].
First, we must specify how sexual reproduction acts on these systems. Suppose that each of a diploid
organisms' two genomes encodes a set of system coecients. We assume that a diploid which has inherited
systems (A
0
;B
0
;C
0
) and (A
00
;B
00
;C
00
) from its two parents has phenotype determined by the system that
averages these two, ((A
0
+A
00
)=2; (B
0
+B
00
)=2; (C
0
+C
00
)=2).
Each genome an organism inherits is generated by meiosis, in which both of its diploid parents recombine
their two genomes, and so anF
1
ospring carries one system copy from each parent, and anF
2
is an ospring
of two independently formed F
1
s. If the parents are from distinct populations, these are simply rst{ and
second{generation hybrids, respectively.
Exactly how the coecients (i.e., entries ofA,B andC) of a haploid system inherited by an ospring from
her diploid parent are determined by the parent's two systems depends on the genetic basis of any variation
in the coecients. Thanks to the randomness of meiotic segregation, the result is random to the extent
that each parent is heterozygous for alleles that aect the coecients. Since the i
th
row of A summarizes
48
how each gene regulates gene i, and hence is determined by the promoter region of gene i, the elements of a
row of A tend to be inherited together, which will create covariance between entries of the same row. It is,
however, a quite general observation that the variation seen among recombinant systems is proportional to
the dierence between the two parental systems.
Ospring formed from two phenotypically identical systems do not necessarily exhibit the same phenotype
as both of its parents { in other wordsN , the set of all systems phenotypically equivalent to a given one,
is not, in general, closed under averaging or recombination. If sexual recombination among systems drawn
fromN yields systems with divergent phenotypes, populations containing signicant diversity inN can carry
genetic load, and isolated populations may fail to produce hybrids with viable phenotypes.
0.05
0.05
0.1
0.1
0.15
0.15
0.2
0.2
0.25
0.25
0.3
0.3
0.35
0.35
0.4
0.45
0.5
● ● ● ● ● ● ● 0.05
0.05
0.1
0.1
0.15
0.15
0.2
0.2
0.25
0.25
0.3
0.3
0.35
0.35
0.4
0.45
0.5
● ● ●● ● ● ● parental
F1
F2
Figure 3.4: A conceptual gure of the tness consequences of hybridization: axes represent system coe-
cients (i.e., entries ofA); the line of optimal system coecients is down in black; dotted lines give phenotypic
distances to the optimum. A pair of parental populations are shown in black, along the optimum; a hy-
pothetical population of F
1
s are shown in red, and the distribution of one type of F
2
is shown in purple
(other types of F
2
are not shown; some would be a similar distance to the other side of the optimal set).
The distribution of F
2
hybrids is appropriate for mixed homozygotes if both traits have a simple, one-locus
genetic basis, but there is variation within each population at that locus.
3.2.2 Hybrid incompatibility
Two parents with the optimal phenotype can produce ospring whose phenotype is suboptimal if the parents
have dierent underlying systems. How quickly do hybrid phenotypes break down as genetic distance between
parents increases? We will quantify how far a system's phenotype is from optimal using a weighted dierence
between impulse response functions. Suppose that(t) is a nonnegative, smooth, square-integrable weighting
49
function, h
0
(t) is the optimal impulse response function and dene the \distance to optimum" of another
impulse response function to be
D(h) =
Z
1
0
(t)kh(t)h
0
(t)k
2
dt
1=2
: (3.5)
Consider reproduction between a parent with system (A;B;C) and another displaced by distance in the
direction (X;Y;Z), i.e., having system (A +X;B +Y;C +Z). We assume both are \perfectly adapted"
systems, i.e., having impulse response functionh
0
(t), and their ospring has impulse response functionh
(t).
A Taylor expansion ofD(h
) in shows that the phenotype of anF
1
hybrid between these two is at distance
proportional to
2
from optimal, while F
2
hybrids are at distance proportional to . This is because an F
1
hybrid has one copy of each parental system, and therefore lies directly between the parental systems (see
Figure 3.4) { the parents both lie inN , which is the valley dened by D, and so their midpoint only diers
from optimal due to curvature ofN . In contrast, anF
2
hybrid may be homozygous for one parental type in
some coecients and homozygous for the other parental type in others; this means that each coecient of an
F
2
may be equal to either one of the parents, or intermediate between the two; this means that possible F
2
systems may be as far from the optimal set,N , as the distance between the parents. The precise rate at which
the phenotype of a hybrid diverges depends on the geometry of the optimal setN relative to segregating
genetic variation. For an illustration of hybrid misregulation due to system drift, consult Example 3 below.
Example 3 (Hybrid incompatibility: misregulation due to system drift). Ospring of two equivalent systems
from Example 2 can easily fail to oscillate. For instance, the F
1
ospring between homozygous parents at
= 0 and =2 has phenotype
F1
(t) = e
t
, rather than (t) = sint + cost. However, the coecients
of these two parental systems dier substantially, probably more than would be observed between diverging
populations. In Figure 3.5 we compare the phenotypes for F
1
and F
2
hybrids between more similar parents,
and see increasingly divergent phenotypes as the dierence between the parental systems increases. (In this
example, the coecients of A() dier from those of A(0) by an average factor of 1 +=2; such small
dierences could plausibly be caused by changes to promoter sequences.) This divergence is quantied in
Figure 3.6, which shows that mean distance to optimum phenotype of the F
1
andF
2
hybrid ospring between
A(0) and A() increases with
2
and , respectively.
50
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
0 5 10 15 20 25 30
−6 −4 −2 0 2 4 6
Phenotype
t
h(t)
Figure 3.5: (left) Phenotypes of F
1
hybrids between an A(0) parent and, top-to-bottom, an A(
1
=100), an
A(
1
=10), and A(
1
=2) parent; parental coecients dier by around 0.5%, 5%, and 25% respectively. Parental
phenotypes (sint+cost) are shown in solid black, and hybrid phenotypes in dashed blue. (right) Phenotypes
of all possibleF
2
hybrids between the same set of parents, with parental phenotype again in black. Dierent
colored lines correspond to dierent F
2
hybrids; note that some completely fail to oscillate.
51
0.0 0.2 0.4 0.6 0.8
0.0 0.5 1.0 1.5 2.0
Genetic distance
Phenotypic distance, D
F1 hybrids
F2 hybrids
Figure 3.6: Mean hybrid phenotypic distance from optimum computed with equation (3.5), using (t) =
exp(t=4) forF
1
(black) andF
2
(blue) hybrids betweenA(0) andA() parent oscillators. Genetic distance
is computed as
P
ij
(A
ij
(0)A
ij
())
2
1=2
.
3.2.3 Haldane's rule
This model naturally predicts Haldane's rule, the observation that if only one hybrid sex is sterile or inviable
it is likely the heterogametic sex (e.g., the male in XY sex determination systems) [Haldane, 1922, Orr,
1997]. For example, consider an XY species with a two-gene network where the rst gene resides on an
autosome and the second gene on the X chromosome. A male whose pair of haplotypes is
[
A1 A2
];
A1 A2
X1 X2
has phenotype determined by A =
A1 A2
X1 X2
, if dosage compensation up-regulates heterogametes by a factor
of two relative to homogametes (as with Drosophila), while a female homozygous for the haplotype
h
A1
A2
X1
X2
i
,
has phenotype determined by A =
h
A1
A2
X1
X2
i
. An F
1
male ospring of these two will have its phenotype
determined by
h
(A1+
A1)=2 (A2+
A2)=2
X1
X2
i
. If both genes resided on the autosomes, this system would only be
possible in anF
2
cross. More generally, if the regulatory coecients for a system are shared between the sex
and one or more autosomal chromosomes,F
1
males are eectively equivalent to purely autosomal-systemF
2
hybrids, and recall that F
2
s are signicantly less t on average than F
1
s (see Figure 3.6). Although many
alleles will be dominant if the phenotype{tness relationship is convex, the underlying mechanism does not
depend on the dominance theory Turelli and Orr [1995] to explain Haldane's rule: instead, it derives from
the nature of segregation variance.
52
3.3 System drift and the accumulation of incompatibilities
Thus far we have seen that many distinct molecular mechanisms can realize identical phenotypes and that
these mechanisms may fail to produce viable hybrids. Does evolution shift molecular mechanisms fast
enough to be a signicant driver of speciation? To approach this question, we explore a general quantitative
genetic model in which a population drifts stochastically near a set of equivalent and optimal systems due
to the action of recombination, mutation, and demographic noise. Although this is motivated by the results
on linear systems above, the quantitative genetics calculations are more general, and only depend on the
presence of genetic variation and a continuous set of phenotypically equivalent systems.
We will suppose that each organism's phenotype is determined by its vector of coecients, denoted by
x = (x
1
;x
2
;:::;x
p
), and that the corresponding tness is determined by the distance of its phenotype to
optimum. The optimum phenotype is unique, but is realized by many distinct x { those falling in the
\optimal set"N . The phenotypic distance to optimum of an organism with coecients x is denoted D(x).
In the results above, x = (A;B;C) and D(x) is given by equation (3.5). The tness of an organism with
coecientsx will be exp(D(x)
2
). We assume that in the region of interest, the map D is smooth and that
we can locally approximate the optimal setN as a quadratic surface. As above, an individual's coecients
are given by averaging its parentally inherited coecients and adding random noise due to segregation and
possibly new mutation. Concretely, we use the innitesimal model for reproduction [Barton et al., 2017]
{ the ospring of parents at x and x
0
will have coecients (x +x
0
)=2 +", where " is a random Gaussian
displacement due to random assortment of parental alleles.
This tness landscape is locally Gaussian with a rank-decient covariance matrix. Since we allow for
substantial genetic variation within populations, this model falls in the same class as Lande [1981] and Lande
and Arnold [1983], which did not consider reproductive incompatibility. Substantial work on speciation has
been done under the assumption of a monomorphic population whose trait is shifted by sequential xation
of alleles, i.e., Fisher's \geometric model" [Fisher, 1930, Poon and Otto, 2000]. This work has included both
stationary optima (like we study) and moving optima (i.e., adaptation) [e.g., Barton, 2001, Chevin et al.,
2014]. Martin [2014] derived this model from a few general assumptions, and used random matrix theory
to calculate the distribution of tness eects. Chevin et al. [2014] studied a general version with neutral
directions, and found that the rate of accumulation of reproductive isolation decreases with N
e
with a form
that depends on trait dimension. Fra sse et al. [2016] showed that most recognized empirical patterns in
the speciation literature could be explained by this model (although best if tness took the form exp(d
k
)
for some k > 2, where d is distance to the optimum), and Simon et al. [2017] further compared predictions
53
from the model to empirical data. In the context of a stationary optimum, the primary contribution to
hybrid untness is segregation variance, i.e., greater phenotypic variance in F
2
hybrids than in parentals
due to drift having changed the genetic basis of the trait separately in the parental populations (Chevin
et al. [2014] refers to these as \transgressive incompatibilities"). This turns out to be the main source of
incompatibilities in the model we study as well, even though in principle curvature of the optimal set also
contributes (as also seen in Rosas et al. [2010]). Our analysis relies on local approximations; on longer time
scales the appropriate model may share properties with the \holey" landscapes of Gavrilets [2004].
As our goal here is to sketch out the rough implications of our main results, keeping the assumptions
clearly visible, we provide relatively rough arguments, rather than presenting calculations in full multivariate
generality.
3.3.1 System drift
We work with a randomly mating population of eective size N
e
. If the population variation has standard
deviation in a particular direction, since subsequent generations resample from this diversity, the population
mean coecient will move a random distance of size =
p
N
e
per generation, simply because this is the
standard deviation of the mean of a random sample [Lande, 1981]. Selection will tend to restrain this
motion, but movement along the optimal setN is unconstrained, and so we expect the population mean
to drift along the optimal set like a particle diusing. The amount of variance in particular directions in
coecient space depends on constraints imposed by selection and correlations between the genetic variation
underlying dierent coecients (theG matrix [Arnold et al., 2008]). It therefore seems reasonable to coarsely
model the time evolution of population variation in regulatory coecients as a \cloud" of width about the
population mean, which moves as an unbiased Brownian motion through the set of network coecients that
give the optimal phenotype.
Next, we calculate with some simplifying assumptions to give the general idea; multivariate derivations
appear in Appendix 3.A. There will in general be dierent amounts of variation in dierent directions;
to keep the discussion intuitive, we only discuss
N
, the amount of variation in \neutral" directions (i.e.,
directions alongN ), and
S
, the amount of variation in \selected" directions (perpendicular toN ). The
other relevant scale we denote by
, which is the scale on which distance to phenotypic optimum changes
as x moves away from the optimal set,N . Concretely,
is 1=(
d
du
D(x +uz)) where x is optimal and z is
a \selected" direction perpendicular toN . With these parameters, a typical individual will have a tness
of around exp((
S
=
)
2
). Of course, there are in general many possible neutral and selected directions; we
54
take
to be an appropriate average over possible directions.
Hybridization
The means of two allopatric populations each of eective size N
e
separated for T generations will be a
distance roughly of order 2
N
p
T=N
e
apart alongX . (Consult Figure 3.4 for a conceptual diagram.) A
population of F
1
hybrids has one haploid genome from each, whose coecients are averaged, and so will
have mean system coecients at the midpoint between their means. The distribution of F
2
hybrids will
have mean at the average of the two populations, but will have higher variance [Wright, 1968, Barton,
2001, Rosas et al., 2010]. This \segregation variance" of F
2
hybrids can be shown to increase linearly with
the square of the distance between parental population means under models of both simple and polygenic
traits. This is suggested by Figure 3.4 and shown by Slatkin and Lande [1994] (also see Appendix 3.B for
a dierent derivation). Concretely, we expect the population of F
1
s to have variance
2
S
in the selected
direction (the same as within each parental population), but the population of F
2
hybrids will have variance
2
S
+ 4!
2
N
T=N
e
, where! is a factor that depends on the genetic basis of the coecients. If the optimal set
N has dimension q, using the polygenic model of appendix 3.B, ! is proportional to the number of degrees
of freedom: ! = (pq)=8. If each trait is controlled by a single locus, as in Figure 3.4, the value is similar.
What are the tness consequences? A population of F
2
hybrids will begin to be substantially less t
than the parentals once they dier from the optimum by a distance of order
, i.e., once
p
4!T=N
e
=
N
.
This implies that hybrid incompatibility among F
2
hybrids should appear much slower { on a time scale of
N
e
(
=
N
)
2
=(4!) generations. The F
1
s will not suer tness consequences until the hybrid mean is further
than
from the optimum; as suggested by Figure 3.4, Taylor expanding D
2
along the optimal set implies
that this deviation of the mean from optimum grows with the square of the distance between the parental
populations, and so we expect tness costs in F
1
s to appear on a time scale of N
2
e
generations.
For a more concrete prediction, suppose that the distribution among hybrids is Gaussian. A population
whose trait distribution is Gaussian with mean and variance , has mean tness
Z
1
1
1
p
2
e
(x)
2
2
2
e
x
2
2
2
dx =
s
1
1 +
2
=
2
exp
2
2
1
1 +
2
=
2
: (3.6)
This assumes a single trait, for simplicity. A population of F
2
hybrids will have, as above, variance
2
=
2
S
+ 4!
2
N
T=N
e
. The mean diverges with the square of the distance between the parentals, so we set
=c
T=N
e
, where c
is a constant depending on the local geometry of the optimal set. The mean tness
55
in parental populations is as in Equation 3.6 with = 0 and =
S
. This implies that if we deneF
2
(T )
to be the mean relative tness amongF
2
hybrids between two populations separated byT generations, (i.e.,
the mean tness divided by the mean tness of the parents) then
F
2
(T ) =
1 +
4!(
N
=
)
2
(1 + (
S
=
)
2
)
T
N
e
1=2
exp
(
c
T
N
e
2
1
1 + (
S
=
)
2
+ 4!(
N
=
)
2
T=N
e
)
: (3.7)
If each of the q selected directions acts independently, the drop in tness will beF
2
(T )
q
; the expression for
the correlated, multivariate case is given in Appendix 3.A.1. We discuss the implications of this expression
in the next section.
Speciation rates under neutrality
Equation (3.7) describes how fast hybrids become inviable as the time that the parental populations are
isolated increases; what does this tell us about speciation rates under neutrality? From equation (3.7) we
observe that time is always scaled in units of N
e
generations, the population standard deviations are always
scaled by
, and the most important term is the rate of accumulation of segregation variance, 4!(
N
=
)
2
. All
else being equal, this process will lead to speciation more quickly in smaller populations and in populations
with more neutral genetic variation (larger
N
). These parameters are related { larger populations generally
have more genetic variation { but since these details depend on the situation, we leave these separate.
How does this prediction depend on the system size and constraint? If there are p trait dimensions,
constrained inq dimensions, and if! is proportional topq, then the rate thatF
2
tness drops is, roughly,
(1 + 4(p q)KT=N
e
)
q=2
/ q(p q), where K is a constant. Both degree of constraint and number
of available neutral directions aect the speed of accumulation of incompatibilities { more unconstrained
directions allows faster system drift, but more constrained directions implies greater tness consequences of
hybridization. However, note that in real systems, it is likely that
also depends on p and q.
Now we will interpret equation (3.7) in three situations plausible for dierent species, depicting how
hybrid tness drops as a function ofT=N
e
in Figure 3.7. In all cases, the tness drop forF
1
hybrids is much
smaller than that of F
2
hybrids, so we work only with the rst (square-root) term in equation (3.7).
Suppose in a large, genetically diverse population, the amount of heritable variation in the neutral and
selected directions are roughly equal (
N
S
) but the overall amount of variation is (weakly) constrained
by selection (
N
). If so, then the rst term of equation (3.7) is 1=
p
1 + 2!T=N
e
1!T=N
e
. If also
! = 1, then, for instance, after 0:1N
e
generations the average F
2
tness has dropped by 10% relative to the
56
parentals.
Consider instead a much smaller, isolated population whose genetic variation is primarily constrained
by genetic drift, so that
N
S
. Setting a = (
N
=
)
2
to be small, the tness of F
2
hybrids is
F
2
1=
p
1 + 4!aT=N
e
1 2!aT=N
e
. Hybrid tness seems to drop more slowly in this case in Figure
3.7, but since time is scaled by N
e
, so speciation may occur faster than in a large population. However, at
least in some models [Lynch and Hill, 1986], in small populations at mutation-drift equilibrium the amount
of genetic variance (
2
N
) is proportional to N
e
, which would compensate for this dierence, perhaps even
predicting the rate of decrease of hybrid tness to be independent of population size for small populations.
In the other direction, consider large metapopulations (or a \species complex") among which heritable
variation is strongly constrained by selection (i.e., there is substantial recombination load), so that
S
but
N
=
is large. Then the tness of F
2
hybrids isF
2
1=
p
1 + 2!aT=N
e
1!aT=N
e
, and could be
extremely rapid if a is large.
For instance, between two populations of one million organisms that has 10 generations per year (a
drosophilid species, perhaps) under the \large population" scenario of Figure 3.7A, system drift would lead
to a substantial tness drop of around 10% in F
2
hybrids in only 10,000 years. This drop may be enough to
induce evolutionary reinforcement of reproductive isolation. If one thousand of these organisms is isolated
(perhaps on an island, as in Figure 3.7B), then a similar drop could occur in around 120 years. On the other
hand, if the population is one of several of similar size that have recently come into secondary contact after
population re-expansion, the situation may be similar to that of Figure 3.7C withN
e
= 10
6
, and so the same
drop could occur after 1,100 years. (However, hyperdiverse populations of this last type may not be stable
on these time scales.)
57
0 1 2 3 4 5
0.0 0.2 0.4 0.6 0.8 1.0
large population
T N
e
F2 fitness
0 1 2 3 4 5
0.0 0.2 0.4 0.6 0.8 1.0
small population
T N
e
F2 fitness
0 1 2 3 4 5
0.0 0.2 0.4 0.6 0.8 1.0
metapopulation
T N
e
F2 fitness
F2 fitness
F1 fitness
Figure 3.7: Mean drop of F
1
and F
2
tness relative to parental species, with ! = 1 and (A)
2
N
=
2
S
=
2
(B)
2
N
=
2
S
= 0:1
2
(C) 0:1
2
N
=
2
S
=
2
. The F
2
tness is from equation (3.7), and the F
1
tness
is determined only by the exponential term of that equation, which is small relative to that of increased
variance.
3.3.2 Genetic variation in empirical regulatory systems
What is known about the key quantity above, the amount of heritable variation in real regulatory networks?
The coecient A
ij
from the system (4.1) measures how much the rate of net production of i changes per
change in concentration ofj. It is generally thought that regulatory sequence change contributes much more
to inter- and intraspecic variation than does coding sequence change aecting molecular structure [Schmidt
et al., 2010]. In the context of transcription factor networks this may be aected not only by the binding
strength of molecule j to the promoter region of gene i but also the eects of other transcription factors
(e.g., cooperativity) and local chromatin accessibility [Steova et al., 2013]. For this reason, the mutational
target size for variation in A
ij
may be much larger than the dozens of base pairs typically implicated in the
handful of binding sites for transcription factorj of a typical promoter region, and single variants may aect
many entries ofN simultaneously.
Variation in binding site occupancy may overestimate variation in A, since it does not capture buering
eects (if for instance only one site of many needs to be occupied for transcription to begin), and variation
in expression level measures changes in steady-state concentration (our
i
) rather than the rate of change.
Nonetheless, these measures likely give us an idea of the scale of variability. It has been shown that between
58
human individuals, there is dierential occupancy in 7.5% of binding sites of a transcription factor (p65)
[Kasowski et al., 2010]. It has also been inferred that cis-regulatory variation accounts for around 2{6% of
expression variation in human blood-derived primary cells [Verlaan et al., 2009], and that human population
variation explained about 3% of expression variation [Lappalainen et al., 2013]. Allele-specic expression is
indicative of standing genetic cis-regulatory variation; allele-specic expression in 7.2{8.5% of transcripts of
a
ycatcher species has been observed [Wang et al., 2017], as well as allele-specic expression in 23.4% of
genes studied in a baboon species [Tung et al., 2015]. Taken together, this suggests that variation in the
entries of A may be on the order of at least a few percent between individuals of a population { doubtless
varying substantially between species and between genes.
3.4 Discussion
In this paper, we use tools from linear system theory and quantitative genetics to study the evolution
of a mechanistic model of the genotype-phenotype map, in which the phenotype is subject to stabilizing
selection. In so doing, we provide an explicit model of phenogenetic drift [Weiss and Fullerton, 2000] and
developmental system drift [True and Haag, 2001]. In this context, the Kalman decomposition [Kalman,
1963] gives an analytical description of all phenotypically equivalent gene networks. It also implies that nearly
all systems are nonidentiable, and that, in general, there exist axes of genetic variation unconstrained by
natural selection. The independent movement of separated populations along these axes by genetic drift can
lead to a signicant reduction in hybrid viability, and thus precipitate speciation, at a speed dependent on
the eective population size and the amount of genetic variation. In this model, at biologically reasonable
parameter values, system drift is a signicant { and possibly rapid { driver of speciation. This may be
surprising because hybrid inviability appears as a consequence of recombining dierent, yet functionally
equivalent, mechanisms, and since species are often dened by their unique adaptations or morphologies.
Consistent with empirical observation of hybrid breakdown (e.g., Pl otner et al. [2017]), we see that the
tnesses of F
2
hybrids drop at a much faster rate than those of F
1
s. Another natural consequence of the
model is Haldane's rule, that if only oneF
1
hybrid sex is inviable or sterile it is likely to be the heterogametic
sex. This occurs because if the genes underlying a regulatory network are distributed among both autosomes
and the sex chromosome, then heterogametic F
1
s show variation (and tnesses) similar to that seen in F
2
hybrids.
Is there evidence that this is actually occurring? System drift and network rewiring has been inferred
59
across the tree of life [Wotton et al., 2015, Crombach et al., 2016b, Dalal and Johnson, 2017, Johnson, 2017,
Ali et al., 2017], and there is often signicant regulatory variation segregating within populations. Transcrip-
tion in hybrids between closely related species with conserved transcriptional patterns can also be divergent
[Haerty and Singh, 2006, Maheshwari and Barbash, 2012, Coolon et al., 2014, Michalak and Noor, 2004,
Mack and Nachman, 2016], and hybrid incompatibilities have been attributed to cryptic molecular diver-
gence underlying conserved body plans [Gavin-Smyth and Matute, 2013]. Furthermore, in cryptic species
complexes (e.g., sun skinks [Barley et al., 2013]), genetically distinct species may be nearly morphologically
indistinguishable.
3.4.1 Fisher's geometric model
Substantial analytical work on speciation has been done using Fisher's geometric model, which can also pre-
dict Haldane's rule and the regular increase of incompatibility with genetic distance [Barton, 2001, Fra sse
et al., 2016, Simon et al., 2017]. The model is similar to the quantitative genetics model we use to predict the
rate of speciation, although work on Fisher's geometric model generally only models substitutions between
populations, neglecting within-species polymorphism. Indeed, our model may provide an additional mecha-
nistic context in which Fisher's geometric model is appropriate, although the typical degree of pleiotropy and
stability of the G-matrix are unknown in practice. Our observations argue for the importance of including
neutral directions in these models, which is not usually done (but see Chevin et al. [2014]).
Although our main focus is on the implications of system theory, rather than on the quantitative genetics
modeling, it is interesting to compare analytic results. Chevin et al. [2014] nds that mean tness decrease
of hybrids after T generations grows proportionally to mN
1m=2
e
T , for large N
e
; while our corresponding
rate is mT=N
e
. This dierence in time scaling is substantial, especially for large populations, and seems to
appear because Chevin et al. [2014] uses the results of Sella and Hirsh [2005] to describe how the steady-state
tness and substitution process depends on trait dimensionality (the tness peak is harder to nd in higher
dimensions). We do not incorporate this eect because genetic variation may be provided by other sources,
but if we did, it would enter through our
S
.
3.4.2 The origin of species not by means of natural selection?
As classically formulated, the Dobzhansky-Muller model of hybrid incompatibility is agnostic to the relative
importance of neutral versus selective genetic substitutions [Coyne and Orr, 1998], and plausible mechanisms
have been proposed whereby Dobzhansky{Muller incompatibilities could originate under neutral genetic drift
60
[Lynch and Force, 2000b] or stabilizing selection [Fierst and Hansen, 2009]. The same holds for the \pathway
model" [Lindtke and Buerkle, 2015], which is closer to the situation here. However, previous authors have
argued that neutral processes are likely too slow to be a signicant driver of speciation [Nei et al., 1983,
Seehausen et al., 2014]. This has led some to conclude that hybrid incompatibility is typically a byproduct
of positive selection [Orr et al., 2004, Schluter, 2009] or a consequence of genetic con
ict [Presgraves, 2010,
Crespi and Nosil, 2013], two processes that typically act much more rapidly than genetic drift. However, our
calculations suggest that even under strictly neutral processes, hybrid tness breaks down as a function of
genetic distance rapidly enough to play a substantial role in species formation across the tree of life. This is
consistent with broad patterns such as the relationship between molecular divergence and genetic isolation
seen by Roux et al. [2016], and the clocklike speciation rates observed by Hedges et al. [2015].
Neutral processes are certainly not the only drivers of speciation. All of these forces { adaptive shifts,
con
ict, and network drift { are plausible drivers of speciation, and may even interact. Many of our obser-
vations carry over to models of directional selection { for instance, rapid drift along the set of equivalent
systems could be driven by adaptation in a dierent, pleiotropically coupled system. Or, reinforcement due
to local adaptation might provide a selective pressure that speeds up system drift. Furthermore, while the
tness consequences of incompatibility in any one given network may be small, the cumulative impact of
system drift across the many dierent networks an organism relies on may be substantial. It remains to be
seen how the relative strengths of these forces compare.
3.4.3 Nonlinearity and model assumptions
Of course, real regulatory networks are not linear dynamical systems. Most notably, physiological limits put
upper bounds on expression levels, implying saturating response curves. It remains to be seen how well these
results carry over into real systems, but the fact that most nonlinear systems can be locally approximated
by a linear one suggests our qualitative results may hold more generally. Furthermore, nonidentiability
(which implies the existence of neutral directions) is often found in practice in moderately complex models
of biological systems [Gutenkunst et al., 2007, Piazza et al., 2008].
This simple quantitative genetics model we use above has been shown to produce good predictions in
many situations, even when the substantial number of simplifying assumptions are violated [B urger and
Lande, 1994, Turelli and Barton, 1994]. The calculations above should be fairly robust even to substantial
deviations from normality. A larger eect on these predictions seems likely due to correlations due to
molecular constraint, genetic linkage, population structure, historical contingency and so forth. Although
61
such considerations would not change the qualitative predictions of this model, their combined eects could
substantially change the predicted rate of accumulation of incompatibilities.
Finally, despite our model's precise separation of phenotype and kryptotype, this relationship in nature
may be far more complicated as aspects of the kryptotype may be less \hidden" than we currently assume.
For instance, attributes excluded from the phenotype as modelled here, ignore the potential energy costs
associated with excessively large (non-minimal) kryptotypes, as well as the relationship between a specic
network architecture and robustness to mutational, transcriptional, or environmental noise. More precise
modeling will require better mechanistic understanding not only of biological systems, but also the nature
of selective pressures and genetic variation in these systems.
Chapter 3 Appendix
3.A Genetic drift with a multivariate trait
For completeness, we provide a brief exposition of how a population evolves due to genetic drift with a
quantitative genetics model, as in Lande [1981] or Hansen and Martins [1996]. These do not directly model
underlying genetic basis, but developing a more accurate model is beyond the scope of this paper.
Suppose that the population is distributed in trait space as a Gaussian with covariance matrix and mean
, whose density we write as f(; ;). Selection has the eect of multiplying this density by the tness
function and renormalizing, so that if expected tness of x is proportional to exp(kLxk
2
=2), then the
distribution post-selection has density at x proportional to f(x; ;) exp(kLxk
2
=2). By the computation
below (\Completing the square"), the result is a Gaussian distribution with covariance matrix (
1
+L
T
L)
1
and mean (
1
+L
T
L)
1
1
.
After selection, we have reproduction: suppose this occurs as in the innitesimal model [Barton et al.,
2017], so that each ospring of parents with traits x and y is drawn independently from a Gaussian distri-
bution with mean (x +y)=2 and covariance matrixR. Here,R is the contribution of \segregation variance",
i.e., random choices of parental alleles. If
e
= (
1
+L
T
L)
1
is the covariance matrix of the parents post-
selection, then the distribution of ospring will again be Gaussian, with mean equal to that of the parents
and covariance matrix
e
=2 +R.
In summary, a generation under this model modies the mean () and covariance matrix () of a
62
population as follows:
7!
0
= (
1
+L
T
L)
1
1
7!
0
=
1
2
(
1
+L
T
L)
1
+R:
What measures are stable under this transformation? The condition =
0
reduces to L
T
L = 0; if we
assume R and therefore are of full rank, then this happens if and only if is in the null space of L,
i.e., if lies in a neutral direction. The condition
0
= can also be solved, at least numerically. After
rearrangement, it reduces to L
T
L + (I=2RL
T
L) =R. Importantly, the mean does not aect either
how the covariance matrix moves, or its stable shape.
Above we have described the expected motion of the mean and covariance. However, random resampling
will introduce noise. Suppose that a population ofN individuals behaves approximately as described above.
By the above, we may expect that the covariance matrix stays close to a constant value , computed from
R andL as above, so that we need only consider motion of the mean, . Since we take a sample of sizeN to
construct the next generation, the next generation's mean is drawn from a Gaussian distribution with mean
0
and covariance matrix =N. Dening = (I (I + L
T
L)
1
), this can be written as
0
= +=
p
N;
where is a multivariate Gaussian with mean zero and covariance matrix . Let(k) denote the mean in the
k
th
generation, and suppose that diers from optimal by something of order 1=
p
N: if(t) =
p
N(t
p
N)
is the rescaled process, then the previous equation implies that as N !1, in the limit solves the It^ o
equation
d(t) = (t)dt +
1=2
dW (t);
where now W (t) is a multivariate white noise. This has an explicit solution as a multivariate Ornstein-
Uhlenbeck process:
(t) =e
t
(0) +
Z
t
0
e
(ts)
1=2
dW (s):
63
The asymptotic variance of this process in the direction z is
lim
t!1
Var[(t)z] =
Z
1
0
z
T
e
s
e
s
zds; (3.8)
which is innite if and only if z = 0, which occurs if and only if Lz = 0. This implies that at equilibrium,
population mean trait values lie away from the optimal set by a Gaussian displacement of order 1=
p
N with
a covariance matrix given by equation (3.8).
Completing the square First note that if A is symmetric,
(xy)
T
A(xy) =x
T
A (x 2y) +y
T
Ay;
and so if B is also symmetric and A +B is invertible,
(xy)
T
A(xy) +x
T
Bx =x
T
(A +B)
x 2(A +B)
1
Ay
+y
T
Ay
=
x (A +B)
1
Ay
T
(A +B)
x (A +B)
1
Ay
+y
T
Ayy
T
A
T
(A +B)
1
Ay:
Therefore, by substituting A =
1
and B =L
T
L,
f(x; ;y) exp(x
T
L
T
Lx=2)
R
f(z; ;y) exp(z
T
L
T
Lz=2)dz
=f(x; (
1
+L
T
L)
1
; (
1
+L
T
L)
1
1
y):
3.A.1 Gaussian load
Suppose that a population has a Gaussian distribution in d-dimensional trait space with mean and co-
variance matrix , and that tness of an individual at x is exp(kLxk
2
=2). Then, completing the square as
64
above with A =
1
, y =, and B =L
T
L, and dening Q = (
1
+L
T
L)
1
,
1
p
2
n
det()
1=2
Z
e
1
2
x
T
1
x
e
1
2
x
T
L
T
Lx
dx
=
1
p
2
n
det()
1=2
Z
e
1
2
(xQ
1
)
T
Q
1
(xQ
1
)
dx
e
T
(I
1
Q)
1
=
s
det(Q)
det()
exp
T
I
1
Q
1
=
s
1
det() det(
1
+L
T
L)
exp
T
I (I +L
T
L)
1
1
This is the product of two terms: the rst gives the drop in tness due to segregation variance, and the
second the drop due to a shift in mean away from the optimum. A similar decomposition appears in Barton
[2001] and Chevin et al. [2014], but for mean log tness.
Now suppose that =
2
I and L =I=
. Then,
s
1
det() det(
1
+L
T
L)
=
s
1
2d
(1=
2
+ 1=
2
)
d
=
1
(1 + (=
)
2
)
d=2
:
Also,
I (I +L
T
L)
1
1
=
1
2
1 (1 + (=
)
2
)
1
I
=
1
2
1
(1 + (=
)
2
)
I
3.B Evolution of segregation covariance
The description above does not describe how two diverging populations interact, since the amount of segre-
gation variance, quantied by R, will not stay constant. This has been quantied in various models before
[e.g., Slatkin and Lande, 1994, Barton, 2001, Chevin et al., 2014]; we provide the following calculations for
65
completeness.
To get an idea of how segregation variance might evolve, suppose that a trait is determined byL unlinked,
biallelic loci, and that the i
th
locus has two alleles with additive eects
i
, so that being homozygous for
the + allele contributes +2
i
to the trait. For simplicity, we will neglect the eects of selection. If the +
allele at locusi is at frequencyp
i
in a population, then the mean and genetic variance of the trait in a diploid
population with random mating is
m = 2
X
i
(2p
i
1)
i
s
2
= 4
X
i
p
i
(1p
i
)
2
i
:
Segregation variance between two parents depends on the loci at which either are heterozygous, and each
locus contributes independently since alleles are additive. If the alleles are at Hardy-Weinberg proportions,
then since segregation acts like a fair coin
ip, a heterozygous locus contributes
2
i
=4 to the variance, and
so the mean segregation variance, averaging across parents, is
R
0
(p) =
X
i
p
i
(1p
i
)
2
i
:
On the other hand, if the second parent came from a distinct population with frequencies q
i
(an F
1
hybrid), this would be
R
1
(p;q) =
1
2
X
i
p
2
i
(1p
i
)
2
2
i
+
1
2
X
i
q
2
i
(1q
i
)
2
2
i
= (R
0
(p) +R
0
(q))=2:
If we assume that the populations are at equilibrium, R
0
(p)R
0
(q), and so R
1
(p;q)R
0
(p).
Now consider anF
2
hybrid, where both parents areF
1
and so each heterozygous at locusi with probability
p
i
(1q
i
) + (1p
i
)q
i
. Then
R
2
(p;q) =
1
2
X
i
(p
i
(1q
i
) + (1p
i
)q
i
)
2
i
:
Suppose that the two populations are slightly drifted from each other, with frequency dierencep
i
q
i
= 2
i
.
66
Then,
p(1q) +p(1q) = (u +)(1u +) + (u)(1u)
= 2u(1u) + 2
2
:
If the frequencies have evolved neutrally in unconnected, Wright-Fisher populations of eective size N for
t generations from a common ancestor with allele frequency u, then has mean zero and variance roughly
2u(1u)t=N. Still assuming the populations are at stationarity, so that R
0
is constant between the two,
and taking the frequencies p
i
as a proxy for the ancestral frequencies u
i
, this implies that we expect
R
2
R
0
+ 2
X
i
p
i
(1p
i
)
2
i
t=N
=
1 +
2t
N
R
0
:
On the other hand, the expected squared dierence in trait means between two such populations is
8
X
i
p
i
(1p
i
)
2
i
t=N = 8R
0
t=N: (3.9)
This implies that under this model, segregation variance in F
2
hybrids between two populations is roughly
increased by a factor of 1=8 of the dierence between their means.
67
Chapter 4
Chance and necessity in the evolution
of gene network complexity
Abstract
Many biological systems, such as gene regulatory networks, appear to be highly complex. The necessity
of such complexity is not clear, and may be the consequence of non-adaptive and chance processes. Here
we apply system theory to study the evolution of complexity in gene regulatory networks. We show that
network complexity, as dened by the addition of super
uous molecular components, can increase under
neutral conditions within a specic parameter space, where the addition and deletion rates of de novo genes
is nearly equivalent. Under a broader set of circumstances, networks may increase in complexity by recruiting
existing and available genes, or as a consequence of system drift among multiple, simultaneously active,
biological systems; the total number of genes may not increase, but the number of genes underlying a given
trait does. We further suggest that in addition to a neutral complexity ratchet, selection and consolidation of
biological systems may lead to the appearance of gratuitously complex systems, despite actually simplifying
system architecture on a larger scale.
Back to the Table of Contents
68
4.1 Introduction
It is an outstanding problem in biology to identify the myriad of interactions among genes and their products
and further to ascertain the functional and adaptive signicance of these interactions. At rst glance, through
the incomplete application of evolutionary principles, one might draw the conclusion that the architecture of
genetic interactions within a cell are precisely tuned in order to carry out adaptive functions. It has however,
been argued that this hypothesis fails under scrutiny; population genetic models generally do not support
such a view [Lynch, 2007b]. Elsewhere it has been suggested that the architecture of biological systems
can largely be in
uenced by non-adaptive or neutral processes [Gould and Lewontin, 1979]. This can be
intuitively appealing as many systems seem to be of \baroque" design, or otherwise unnecessarily complex,
such as the so called \Rube Goldberg" structure of eukaryotic circadian rhythm regulatory networks [Sancar,
2008] and biological pathways. Similarly, the apparent \gratuitous complexity" of RNA editing [Covello and
Gray, 1993], the eukaryotic spliceosome [Nilsen, 2003, Gray et al., 2010], and the wiring of neuronal circuits
[Dumont and Robertson, 1986] may be the consequence of non-adaptive evolutionary processes.
One such non-adaptive process may be described as a complexity ratchet { sometimes referred to as
constructive neutral evolution [Stoltzfus, 1999, 2012, Luke s et al., 2011] { in which (nearly) neutral changes
x within a population, become integrated into a system, and are dicult to remove or reverse. In particular,
Stoltzfus [2012] asks us to consider the scrambling of cords between a personal computer, printer and monitor
as a metaphor for the scrambled genes observed in Ciliates. Constructive neutral evolution entails that some
initially neutral variation is introduced via some bias (maybe an entropic one), followed by later evolutionary
changes, rendering the loss of such variation to be deleterious. If such a process occurs, systems may \ratchet"
up in complexity, as neutral changes are evolutionarily integrated and become indispensable. This process,
as suggested [Stoltzfus, 1999, Gray et al., 2010] may be conceptually similar to contingent irreversibility
[Szathm ary and Smith, 1995] { the \accidental" process by which for instance mitochondria may have lost the
ability to independently replicate, after symbiotically residing within a larger cell. This neutral complexity
ratchet is also thought to include processes such as the duplication-degeneration-complementation process
by which redundant duplicate genes evolve away from being expendable [Lynch and Force, 2000a].
In this chapter, I aim to study the balance between neutral and selective evolutionary forces on the
structure and complexity of gene regulatory networks. Specically, I ask how much of a system's apparent
complexity is functionally necessary versus a consequence of chance and non-adaptive processes. Do poten-
tially independent systems evolve interconnections and become irreversibly tangled by what has been referred
to as genetic moonlighting (\reusing" genes in other networks) [Speijer, 2011]? Does evolution discriminate
69
between \simple" regulatory networks { networks that, informally, perform a function with as few genes
as possible { and \complex" networks { systems that perform the same function but involve and rely on
the activity of larger sets of genes and components? Or is network organization only supercially complex,
maybe due to the selective consolidation of parts, and in fact wired eciently?
Within a system theoretic framework, given some adaptive function { e.g. the oscillating transcription of
a gene regulatory network involved in circadian rhythm (as introduced above in Ch. 3, and revisited here)
{ we can explicitly evaluate a particular network's tness for such a function as well as to ask how minimal
(or simple) such a network is. The simplicity of a system is a measure of the number of interacting genes
within the network. If a specic function, such as an oscillator, can be achieved, in principle, by as few
as two genes, then any oscillating network involving more than two genes can be viewed as (unnecessarily)
complex.
I expect the complexity and organization of genetic networks to be the consequence of overlapping forces:
selection for adaptive function, population parameters, and genetic drift. In this chapter I apply tools from
system theory and simulate evolution to decompose the eects of these forces on network structure. Here I
aim to clarify the interplay of these processes by using explicit mathematical and computational models.
4.2 Methods
4.2.1 System theory
As in Chapter 3 Section 3.2.1, we model gene regulatory networks as linear dynamical systems. Here we
brie
y review this method. An organism's phenotype (t) is the result of the interconnections of a gene
regulatory network (A;B;C) and an environmental input u(t). The dynamics of a system (A;B;C), is
composed of two equations,
_ (t) =A(t) +Bu(t)
(t) =C(t):
(4.1)
Here,(t) is a list of a system's molecular concentrations at timet. As some system dynamics may not be
visible nor relevant to selection, we distinguish the kryptotype,(t), (as it is hidden) from the phenotype,(t)
{ the system's dynamics visible to selection. The change in internal molecular concentrations as a function
of time, _ (t), is a linear function of the system's current state, as determined by, the system architecture
A (an nn matrix), a time-varying environmental input u(t), and B, an nm matrix which determines
70
how a system processes its environment. Finally, the `n matrix C and outputs a linear combination
of the internal dynamics. Thus the phenotype is simply a convolution of the system organization and the
environment,
(t) =Ce
At
(0) +
Z
t
0
Ce
A(ts)
Bu(s)ds; (4.2)
where we sometimes refer to h(t) := Ce
At
B as the impulse response of the system. Since A species gene
regulatory network architecture, thei
th
row ofA can be roughly interpreted as a summary of the promoter's
eects on the regulation of gene i.
System drift and the Kalman decomposition
As previously demonstrated in Chapter 3, system drift can be modeled within this framework as a simple
coordinate transformation of a minimal system realization, or more generally, by applying the Kalman
decomposition. Biologically, a coordinate transformation changes the strengths of the interactions between
genes, yet preserves the system output. The Kalman decomposition is more general and characterizes the
set of all phenotypically equivalent linear gene regulatory networks, of any nite network size as it allows for
the addition (and when appropriate, the deletion) of genes.
To review, the Kalman decomposition [Kalman, 1963] of a system (A;B;C) gives a change of basis
P such that the transformed system (PAP
1
;PB;CP
1
) has the following form:
PAP
1
=
2
6
6
6
6
6
6
6
4
A
r o
A
r o;ro
A
r o; r o
A
r o; ro
0 A
ro
0 A
ro; ro
0 0 A
r o
A
r o; ro
0 0 0 A
ro
3
7
7
7
7
7
7
7
5
;
and
PB =
2
6
6
6
6
6
6
6
4
B
r o
B
ro
0
0
3
7
7
7
7
7
7
7
5
(CP
1
)
T
=
2
6
6
6
6
6
6
6
4
0
C
T
ro
0
C
T
ro
3
7
7
7
7
7
7
7
5
:
71
The impulse response of the system is given by
h(t) =C
ro
e
Arot
B
ro
;
and therefore, the system is phenotypically equivalent to the minimal system (A
ro
;B
ro
;C
ro
). For a more in
depth description refer to Section 3.2.1 in Chapter 3.
4.2.2 Minimal system realizations
We are interested in studying system complexity and thus need to dene more exactly \complex" versus
\simple" systems. We say that any regulatory network whose function depends on a number of genes
greater than what is, in principle, the smallest number necessary, to be complex. Likewise, any system that
is functionally equivalent to another, yet relies on a smaller number of genes, is said to be simpler than
the other. Here we dene the simplest possible system to be, what in system theory is called a minimal
realization. In this narrow view, any larger system, with respect to function, is unnecessarily complex.
Within this framework we say that a linear dynamical system (A;B;C) (where recall: A2R
nn
;B2R
nm
,
and C2R
`n
) is minimal (or irreducible) if both its controllability matrix,
C :=
B AB ::: A
n1
B
;
and its observability matrix,
O :=
2
6
6
6
6
6
6
6
4
C
CA
.
.
.
CA
n1
3
7
7
7
7
7
7
7
5
:
have ranks equal to n. That is, if and only if rank(C) = rank(O) = n, is the system controllable, and a
system is controllable if and only if it is minimal [Kalman et al., 1969].
4.2.3 Random system generation
Since we are interested in the likelihood of a neutral complexity ratchet, we employ a method to generate
randomly wired gene regulatory networks. Below we examine these random networks' robustness to gene
72
deletion. To generate a random gene regulatory network of a particular size and for a specic function,
we apply the Kalman decomposition. First, a minimal system realization for a chosen phenotype and
a number of non-minimal dimensions is chosen and input into the Kalman decomposition. Next, the free
parameters are drawn from a normal distribution centered at 0, and with standard deviation. A coordinate
transformationP is then applied to the decomposed system, as described above. However, in order to avoid
numerical instability when computing system phenotypes, we need to normalize the eigenvalues of P . A
matrix, P
0
, is generated by randomly choosing elements in the same manner as the free parameters, and its
eigenvalues are normalized so that they fall on the complex unit circle. Concretely, P
0
is decomposed such
that
0
is a diagonal eigenvalue matrix,
P
0
=V
0
V
1
;
the eigenvalues are divided by their complex modulus,
=
0
p
<(
0
)
2
+=(
0
)
2
;
and then reinserted to construct P ,
P =V V
1
:
4.2.4 Evolutionary simulations
We simulated evolution under stabilizing selection for phenotype by rst generating a homogeneous pop-
ulation of systems for a chosen function and network size using the Kalman decomposition as described
above.
Each element of (A;B;C) per individual is then independently mutated with probability . Mutations
change the strength of the regulatory interaction between two genes at a particular site by adding a value
drawn from a normal distribution with mean 0 and standard deviation
to the coecient. New genes
can also be added or removed from a system with probabilities
add
and
del
per gene, respectively. Gene
addition adds an empty (all elements set to 0) row and column to the regulatory network matrix A, and
augments matrices B and C with zeroes appropriately. Similarly, gene deletion, removes a row and column
of matrix A, and also removes the corresponding entries in the B and C matrices. In some simulations we
can x network size by setting both gene addition and deletion rates to zero.
73
Following gene mutation, deletion, and addition, each individual's phenotype is assigned a tness value
by applying the tness function. The tness function is a weighted distance of the phenotype compared to
optimal expression patterns, described below. Next, each individual reproduces asexually with a probability
proportional to its tness, and population size is held constant. This process can be repeated for a chosen
number of generations.
Fitness function
We score a system's tness by computing how far its phenotype is from the optimal one. Suppose that (t)
is a nonnegative, smooth, square-integrable weighting function and
0
(t) is the optimal phenotype, then the
tness of a particular phenotype is given as,
F() = exp
Z
10
0
(t)k(t)
0
(t)k
2
dt
: (4.3)
For all simulations presented we set (t) = exp(
t
=4), and set the input u(t) to be a unit impulse, i.e. a
Dirac function, and so here a system's phenotype is equivalent to its impulse response, (t) =h(t).
4.3 Results
4.3.1 Moonlighting among systems
If independent gene regulatory networks are simultaneously active within a cell it might be possible that
over evolutionary time they become intertwined and functionally co-dependent. Here, I demonstrate this
process theoretically.
Suppose that an organism's tness is a function of two phenotypic traits, given by the systems (A;B;C)
and (
b
A;
b
B;
b
C), whose dynamics are given by,
_ (t) =A(t) +Bu(t)
(t) =C(t)
and
_
^ (t) =
b
A^ (t) +
b
B^ u(t)
^
(t) =
b
C^ (t):
(4.4)
74
We can represent this as one composite system,
2
6
4
_ (t)
^ (t)
3
7
5 =
2
6
4
A 0
0
b
A
3
7
5
2
6
4
(t)
^ (t)
3
7
5 +
2
6
4
B 0
0
b
B
3
7
5
2
6
4
u(t)
^ u(t)
3
7
5
2
6
4
(t)
^
(t)
3
7
5 =
2
6
4
C 0
0
b
C
3
7
5
2
6
4
(t)
^ (t)
3
7
5
(4.5)
If we then apply a coordinate transformation to this composite system, we will preserve its overall
function and phenotype, however, interconnections can form between the systems, and the systems may
become intertwined. As such, the phenotypically equivalent space for the composite system includes network
organizations with interconnections between the systems. In fact, the vast majority of network realizations
following a coordinate transformation will include non-zero terms in the o-diagonal entries of the composite
matrices. This is just a special case of system drift. In this case, complexity, if dened as super
uous genes
(with respect to both systems), does not increase, however the complexity of each individual system does
appear to increase.
This suggests that most systems will be highly connected (i.e. the matrices (A;B;C) will tend to be
dense), implying that most genes will aect most phenotypes { and this phenotypic eect will lie on a
spectrum between innitesimally and signicantly contributing to phenotypic functioning. As this model
predicts the interconnections among systems within an organism to generally increase { this may, in part,
explain the evolution of \omnigenic" traits [Boyle et al., 2017].
It is also possible, however, that as two systems become intertwined, even if both are, by themselves,
minimal, that the minimal realization for the composite system is smaller than the sum of the two individual
systems. In this case, if there is strong enough selection, and a mutational path, network interconnections
may allow for a reduction in composite network size, through what we refer to as reductive selection. Within
the present linear framework, this exact scenario may be quite rare as it would require each of the subsystems
to share eigenvalues. However, this process could still occur approximately (where the eigenvalues are of a
similar value), or perhaps when nonlinear dynamics are considered.
For instance, if we were interested in an organism that relied upon two distinct oscillating genes with the
same frequency, each of which was given by an independent and minimal two-gene system and responded to
the same unit impulse input, the composite minimal realization would also be two dimensional { two fewer
dimensions than the sum of their minimal systems. E.g. the system given byA =
0 1
1 0
;B = [
1
1
];C = [
1 0
0 1
].
In cases similar to this the appearance of complexity may just be a result of the selective consolidation, now
75
highly interconnected networks. Below, as another example of selective consolidation, we simulated the case
in which three independent oscillators evolve, become intertwined and shrink in overall size (Figure 4.1).
Figure 4.1: Evolution of a six-gene system initially composed of three two-gene oscillators. Despite each of
the three subsystems being minimal, the composite form can shrink to as small as two genes. Evolutionary
parameters were set to =
= 10
1
;
del
= 10
2
;
add
= 10
3
.
4.3.2 Ratchet Propensity
To assess the possibility of a gene regulatory network size ratchet we asked whether or not non-minimal gene
regulatory networks of varying sizes were robust to gene deletions. We reasoned that, although the removal
of an initially super
uous gene may be inconsequential, that following a period of system drift the initially
super
uous gene may be dicult, or even impossible to remove without compromising network function, and
thus organismal tness.
If such a process were to occur, networks may ratchet up in complexity over evolutionary time, as more
and more genes become essential for proper network functioning. However, if this process were to go on
76
indenitely, as more genes contribute to network functioning, the phenotypic impact of removing a gene
might diminish with network size, allowing networks to be more easily be pruned. If so the ratchet may
reach some equilibrium network size, which it
uctuates around.
To examine this process, we randomly generated phenotypically equivalent systems of varying network
sizes, sequentially deleting genes (with replacement), and then measuring tnesses (Figure 4.2). If a system
could withstand the removal of at least one gene (i.e. maintain a tness 0:8), we say it is reducible. We
reasoned that a randomly generated system might approximate the state of a network after undergoing a
signicant period of neutral system drift under stabilizing selection.
77
10 20 30 40
0.0 0.2 0.4 0.6 0.8 1.0
network size
reducibility
fitness
0.0 0.2 0.4 0.6 0.8 1.0
reducibility
mean fitness
0 2 4 6 8 10
−6 −4 −2 0 2 4 6
deletions : oscillator
time
phenotype
D = 0.0
D = 2.0
D = 4.0
D = 6.0
D = 8.0
D > 10.0
system sigma=0.0010, extra dims=2
0 2 4 6 8 10
−6 −4 −2 0 2 4 6
deletions : oscillator
time
phenotype
D = 0.0
D = 2.0
D = 4.0
D = 6.0
D = 8.0
D > 10.0
system sigma=0.0010, extra dims=20
Figure 4.2: (Top) Reducibility plot (blue), the likelihood a randomly generated oscillator can function
after removal of at least one gene. (Black) Mean tness of a randomly generated system following a gene
deletion. (Lower) the single gene deletion phenotypes of a randomly generated four-gene (left) twenty-two-
gene (right) oscillator. Phenotypic dynamics are colored by the phenotypic distance relative to optimum,
D =
p
log(F((t)) (see Ch. 3 Eq. 3.5). In this particular four-gene instance all single-gene deletions
signicantly alter the phenotype, however, in the twenty-two-gene case many deletions have a negligible
phenotypic impact (see the darker lines).
4.3.3 Evolutionary dynamics
Network growth
In the previous section, we plotted network reducibility versus network size. This plot suggests that non-
minimal networks, on average, are less reducible than larger networks. Further, it appears that as networks
78
get larger, reducibility increases. This suggests that networks may ratchet upwards in size, while reducibil-
ity is low, and reach an equilibrium network size, until the eective rates of nearly neutral deletion and
recruitment are equal. Despite this demonstration, it is not clear that a population evolving under stabiliz-
ing selection will increase in complexity. To assess this potential we simulated population evolution under
stabilizing selection.
First, we simulated the evolution of a population of 100 individuals for 10
4
generations, varying gene
mutation,, addition,
add
, and deletion,
del
, rates, as well as the average mutational magnitude,
{ that
is, the magnitude of the change to a system regulatory coecient. Recall that the mutation, deletion and
addition rates are per gene. Simulations were initiated with a population of minimal two-gene oscillators (as
in Ch. 3, Example 1), and each run used a combination of parameters, diering by an order of magnitude,
and ranging from 10
1
to 10
4
, and with the condition that
add
del
(a total of eighty combinations).
Depending on the parameter scheme and the initial system conditions chosen, we observe very dierent
evolutionary outcomes. For the simulations in which
add
was strictly less than
del
we observed network
growth for only one of the eighty parameter schemes ( where = 10
2
;
= 10
1
;
del
= 10
3
;
add
= 10
4
).
We did however observe network growth in most of the simulations where addition and deletions were
equivalent.
Genetic necessity
Although some of these gene regulatory networks appear to grow substantially in size, we did not yet know
whether the additional (beyond minimal) genes were actually phenotypically important, or necessary for
survival. To address whether genes were functionally important, we chose individuals from a population and
sequentially (with replacement) deleted each gene and scored the post-deletion tnesses. If a gene deletion
drops relative tness by 50% or more, we deemed such a gene to be essential. If a gene knockout drops
relative tness by more than 5%, such a gene is important, and if a gene deletion actually raises the relative
tness by more than 5%, it is determined to be deleterious.
To understand whether the phenotypic importance of the network growth we observed, we chose the
two highest tness individuals per generation per simulation and assessed each of their gene's phenotypic
importances. We then averaged and plotted the number of essential, important, neutral and deleterious genes
in these two individuals per generation. This analysis revealed that much of the apparent network growth
was phenotypically inconsequential. Only in a subset of simulations in which
del
=
add
, and both and
were each either 10
1
to 10
2
, did gene networks grow in size by adding and integrating phenotypically
79
consequential genes (Figure 4.3).
To further understand this process, we simulated the evolution of a population of 100, initially minimal
oscillators, for 2 10
4
generations, xing gene deletion and addition rates at 10
3
, and varying both the
mutation rate and mutational magnitude between 10
1
and 10
2
among simulations. These results displayed
an intriguing relationship between the mutation rate and mutational magnitude. If both the mutation rates
and magnitudes are relatively high or low (i.e. both either 10
1
or 10
2
), we observe almost no phenotypically
important network growth (Figure 4.4). However we do see substantial, phenotypically important, network
growth when both of these parameters are of moderate value, or if each parameter is drawn from the opposite
extreme of the range presently tested.
Figure 4.3: Top to bottom: mean population network size, population gene phenotypic importance, and
tnesses. (Left) Population of oscillators evolving with the parameters: =
=
del
=
add
= 10
2
. In
this case, the network grows to include a large number of genes, however, only the minimal and initial number
of genes have any phenotypic importance. (Right) Population of oscillators evolving with the parameters:
= 10
1
;
= 10
2
;
del
=
add
= 10
3
In this case, we see that by the end of the simulation, nearly all
possible genes are phenotypically important. Note: network size was capped at 30 only for the simulation
in the right panel.
80
0.02 0.04 0.06 0.08 0.10
5 10 15 20 25 30
mutation rate
number of important genes
μ
σ
= 0.01
μ
σ
= 0.05
μ
σ
= 0.1
Figure 4.4: Number of important genes in the most t system after 2 10
4
generations of evolution versus
mutation rate , with gene addition
add
and deletion
del
rates xed at 10
3
. Dierent lines represent
dierent mutational magnitudes
. All simulations were started with a population of 100 minimal, two-
gene oscillators. Note: network size was capped at 30 to improve computational performance and because
systems with greater than 30 important genes tended to suer from a mutational meltdown and go extinct.
Simulations without gene addition and deletion dynamics
The de novo addition of genes as depicted above may only play a minor role in system evolution. In some
cases, genes may not necessarily join functional networks by the de novo addition of genes, but may instead
be recruited from other contemporaneously active regulatory networks or otherwise available regulatory
sequences. If so, a more relevant model might not allow a mechanism to remove or add gene \slots" as
above, but more simply, use a mechanism that recruits already available and active genes. Above, where
we examined genetic moonlighting among systems, we explicitly modeled the available genes in neighboring
orthogonal networks. Here, we implicitly model this possibility, not by tangling independent networks, but
81
more abstractly, by initially including a number of mutationally available genes. In this case, we may start
with a minimum system with a xed size n
0
, of which only a subset of genes, say the minimum number
n
min
< n
0
, are initially involved in the network (i.e. non-zero), and n
0
n
min
silent but available (i.e. set
to zero) genes. In such a scenario, our simulations show that these networks may slowly recruit available
genes, such that after sucient evolutionary time, many of the n
0
genes become functionally important and
essential to the network's function (Figure 4.5). In this setting natural selection cannot lter by mutational
target size.
Figure 4.5: Example of the evolutionary dynamics of a system with a xed network size
4.4 Discussion
Previously, with Peter Ralph, I developed a theoretical framework to discuss biological system evolution
by incorporating tools from system theory [Schiman and Ralph, 2018, Ch. 3]. Here I further apply
this framework and expand upon it using simulations to study the evolution of gene regulatory network
architecture and complexity. It is often acknowledged that biological systems are bewilderingly complicated,
however the cause and necessity of such complexity is not well understood [Sancar, 2008]. One possible view
is that most biological features are the consequence of natural selection { that is perhaps system complexity is
an adaptation, maybe to improve robustness or to decrease noise, or for some presently unknown reason. An
alternative view might suggest that such complexity is primarily a non-adaptive byproduct of the evolutionary
82
process { due to the particulars of mutation, recombination, or population size, for instance. From this point
of view, network complexity may largely be interpreted as the result of chance and historical accident.
Most evolutionary biologists would probably agree that biological system complexity (broadly dened) is
a consequence of both chance (genetic and system drift) and necessity (adaptive function) [Lynch, 2007b,
Speijer, 2011]. Despite this view, with regards to system architecture, there are very few quantitative
descriptions or predictions of this balance. In this chapter I further develop the system theoretic framework
introduced in Ch. 3, and apply it to study the evolution of gene regulatory network architecture complexity
under two general scenarios: (1) by the addition and deletion of genes, and (2) the tangling of independent
and concurrent networks.
De novo complexity Above I simulate the evolution of a population, whose individuals' phenotypes are
determined by the structure and organization of their gene regulatory networks, under varying population
and molecular parameters. I show that system complexity, at least as dened by the excess number of genes
within a network, can increase while a population is under stabilizing selection. This suggests that network
complexity may increase due to non-adaptive forces. However, the parameter space in which this occurs is
quite small and typically requires gene addition rates to be greater than or equal to deletion rates, and thus
may be rarely realized in actual populations.
Under scenarios where gene deletion rates are greater than gene addition rates, networks rarely grow in
size. However, even if gene deletion and addition rates are equivalent, networks still do not always increase
in size. Network growth tends to occur when the mutation rate and mutational magnitude are both of
moderate value (between 0:01 and 0:1), or if they take values from opposite extremes. This may be because
selection can rapidly act against large and frequent mutations, and smaller, infrequent mutations may drive
network growth too slowly. Small and frequent mutations, however, may be less visible to selection yet drive
system drift fast enough to lead to a complexity ratchet. It appears that selection is acting to reduce the
mutational target size of regulatory networks, however when mutational magnitudes or rates are relatively
small, this pressure is diminished, and if either the mutational magnitude or rate is large enough, system
drift can still occur. This interpretation is tenuous and requires further analysis.
In cases where I did observe phenotypically signicant network growth, I also noticed a concordant drop
in the population mean tnesses, often followed by a mutational meltdown and population extinction. This
is likely an example of Muller's ratchet [Muller, 1932]; all simulations were of asexual populations.
83
Tangling complexity Although network complexity may increase by the de novo addition of genes under
a narrow set of conditions, on a broader scale, independent networks may evolve to become more inter-
connected, and appear to be highly complex. I show that multiple, simultaneously active and independent
networks can increase their interconnectivity over evolutionary time, and that this process is simply a con-
sequence of system drift.
Interestingly, a composite system, even if composed of only minimal systems, itself is not necessarily
minimal. I show that such a system, given a particular range of parameters (e.g. high mutational pressure
and/or strong selection), may actually reduce in size over evolutionary time, and actually consolidate the
previously independent gene regulatory networks.
Furthermore, we show that if network size is xed { perhaps because other networks within the organism
are concurrently active, or possibly because some regulatory sequences are accessible to the basal tran-
scriptional machinery { over evolutionary time, systems will tend to incorporate mutationally-available and
(initially) functionally unnecessary, genes into their regulatory structures. Although similar to the evolution
of de novo complexity, this process is distinct as it relies on the presence of already available, but initially
unincorporated, network components.
This process may be partially responsible for the suggested \omnigenic model" of complex traits; the
observed associations between genetic variation and human disease typically are not concentrated within a
single pathway, but instead are seen distributed across the genome [Boyle et al., 2017]. Omnigenic traits
could arise from the evolutionary intertwining of gene regulatory networks.
Constructive neutral evolution or reductive selection? As mentioned above, I show that similar
to the \scrambling cords" analogy in Stoltzfus [2012], gene regulatory networks may tend to incorporate
available genes into their regulatory structure over evolutionary time. This process is essentially an entropic
one, as the function of the network slowly diuses across the available genes. However, it is much less
likely that new genes will be added to a network if they are not already, somehow functionally, or at least
presently, active. Only under suciently high (although perhaps biologically realistic) gene addition and
mutation rates, will networks ratchet up in complexity, by adding de novo components. Continuing with
the analogy in Stoltzfus [2012], this is the dierence between simply scrambling already present wires versus
adding and then scrambling wires. In the latter case, if scrambling does not occur rapidly enough, selection
may remove the super
uous wires, as they increase the risk of malfunction; in biological terms, these extra
genes increase, however slightly, the mutational target size of the network. There is another important
84
distinction between these two processes: there is no need to invoke a ratchet like mechanism to explain
the scrambling-only process { whether or not a scrambled network can function with or without one of its
components is not the force that adds complexity to networks { it is simply due to the size of the neutral,
phenotypically-equivalent, network space. If the neutral network space is suciently large, the probability
that a system nds itself organized in the simplest (or nearly simplest) conguration becomes exceedingly
improbable. In this view, this process is just a special case of system drift [True and Haag, 2001, Schiman
and Ralph, 2018].
Finally, I suggest the possibility of reductive selection { that under certain conditions, if previously
independent networks become interconnected, previously indispensable genes may be pruned by selection,
as the networks reorganize and consolidate genes by sharing kryptotypes, possibly increasing eciency
and reducing mutational target size. Despite this process behaving in a manner opposite of a process
like constructive neutral evolution { the supercial results might look very similar in some cases. As two
systems become intertwined and then pruned, it may be challenging to determine the function of a \part"
disconnected from the \whole". This reductive process might also lead to irreversible contingency as things
become interconnected and co-dependent.
Despite this chapter's focus on models such as constructive neutral evolution { a process inferred across
a wider range of systems than just gene regulatory networks { it is important to note that our analysis and
results are specic to our modeling framework of linear regulatory networks, and may not translate to other
discussions of complexity, such as the evolution of the spliceosome.
85
Bibliography
Sammi Ali, Sarah Signor, Konstantin Kozlov, and Sergey Nuzhdin. Quantitative variation and evolution of
spatially explicit morphogen expression in Drosophila. bioRxiv, page 175711, 2017. (page 60)
BDO Anderson, RW Newcomb, RE Kalman, and DC Youla. Equivalence of linear time-invariant dynamical
systems. Journal of the Franklin Institute, 281(5):371{378, 1966. (page 40), (page 45)
Stevan J Arnold, Reinhard B urger, Paul A Hohenlohe, Beverley C Ajie, and Adam G Jones. Understanding
the evolution and stability of the G-matrix. Evolution, 62(10):2451{2461, 2008. ISSN 1558-5646. doi: 10.
1111/j.1558-5646.2008.00472.x. URL http://dx.doi.org/10.1111/j.1558-5646.2008.00472.x. (page
54)
Anthony J Barley, Jordan White, Arvin C Diesmos, and Rafe M Brown. The challenge of species delimitation
at the extremes: diversication without morphological change in philippine sun skinks. Evolution, 67(12):
3556{3572, 2013. (page 60)
N H Barton. The maintenance of polygenic variation through a balance between mutation and stabilizing
selection. Genet Res, 47(3):209{216, June 1986. doi: 10.1017/S0016672300023156. URL https://www.
ncbi.nlm.nih.gov/pubmed/3744046. (page 41)
Nicholas H. Barton. The divergence of a polygenic system subject to stabilizing selection, mutation and
drift. Genetics Research, 54(1):59{78, 1989. (page 41)
Nicholas H. Barton. The role of hybridization in evolution. Molecular Ecology, 10(3):551{568, 2001. (page
41), (page 53), (page 55), (page 60), (page 65)
Nicholas H Barton, Alison M Etheridge, and Amandine V eber. The innitesimal model: Denition, deriva-
tion, and implications. Theor Popul Biol, 118:50{73, December 2017. doi: 10.1016/j.tpb.2017.06.001. URL
https://www.ncbi.nlm.nih.gov/pubmed/28709925. (page 53), (page 62)
Richard Ernest Bellman and Karl Johan
Astr om. On structural identiability. Mathematical biosciences, 7
(3-4):329{339, 1970. (page 40)
Aviv Bergman and Mark L Siegal. Evolutionary capacitance as a general feature of complex gene networks.
Nature, 424(6948):549{552, 2003. (page 40)
Evan A. Boyle, Yang I. Li, and Jonathan K. Pritchard. An expanded view of complex traits: From polygenic
to omnigenic. Cell, 169(7):1177{1186, 2017. ISSN 00928674. doi: 10.1016/j.cell.2017.05.038. URL http:
//dx.doi.org/10.1016/j.cell.2017.05.038. (page 75), (page 84)
86
Robert K Bradley, Xiao-Yong Li, Cole Trapnell, Stuart Davidson, Lior Pachter, Hou Cheng Chu, Leath A
Tonkin, Mark D Biggin, and Michael B Eisen. Binding Site Turnover Produces Pervasive Quantitative
Changes in Transcription Factor Binding between Closely Related Drosophila Species. PLoS biology, 8(3):
e1000343, March 2010. (page 12)
Reinhard B urger and Russell Lande. On the distribution of the mean and variance of a quantitative trait
under mutation-selection-drift balance. Genetics, 138(3):901{912, 1994. (page 61)
Aleksandra A. Chertkova, Joshua S. Schiman, Sergey V. Nuzhdin, Konstantin N. Kozlov, Maria G. Sam-
sonova, and Vitaly V. Gursky. In silico evolution of the Drosophila gap gene regulatory sequence un-
der elevated mutational pressure. BMC Evolutionary Biology, 17(1):4, 2017. ISSN 1471-2148. doi:
10.1186/s12862-016-0866-y. URL http://dx.doi.org/10.1186/s12862-016-0866-y. (page 40)
Luis-Miguel Chevin, Guillaume Decorzent, and Thomas Lenormand. Niche dimensionality and the genetics
of ecological speciation. Evolution, 68(5):1244{1256, 2014. (page 41), (page 53), (page 54), (page 60),
(page 65)
Joseph D Coolon, C Joel McManus, Kraig R Stevenson, Brenton R Graveley, and Patricia J Wittkopp.
Tempo and mode of regulatory evolution in Drosophila. Genome research, 24(5):797{808, 2014. (page 60)
PatrickS Covello and MichaelW Gray. On the evolution of rna editing. Trends in Genetics, 9(8):265{268,
1993. (page 69)
Jerry A Coyne and H Allen Orr. The evolutionary genetics of speciation. Philosophical Transactions of the
Royal Society of London B: Biological Sciences, 353(1366):287{305, 1998. (page 60)
Gheorghe Craciun and Casian Pantea. Identiability of chemical reaction networks. Journal of Mathematical
Chemistry, 44(1):244{259, 2008. (page 40)
Bernard Crespi and Patrik Nosil. Con
ictual speciation: species formation via genomic con
ict. Trends in
Ecology & Evolution, 28(1):48{57, 2013. (page 61)
Justin Crocker, Namiko Abe, Lucrezia Rinaldi, Alistair P McGregor, Nicol as Frankel, Shu Wang, Ahmad
Alsawadi, Philippe Valenti, Serge Plaza, Fran cois Payre, Richard S Mann, and David L Stern. Low anity
binding site clusters confer hox specicity and regulatory robustness. Cell, 160(1-2):191{203, January 2015.
(page 12)
Anton Crombach, M onica A Garc a-Solache, and Johannes Jaeger. Evolution of early development in dipter-
ans: Reverse-engineering the gap gene network in the moth midge Clogmia albipunctata (Psychodidae).
Bio Systems, 123:74{85, June 2014. (page 11)
Anton Crombach, Karl R Wotton, Eva Jim enez-Guri, and Johannes Jaeger. Gap Gene Regulatory Dynamics
Evolve along a Genotype Network. Molecular biology and evolution, 33(5):1293{1307, May 2016a. (page
11)
Anton Crombach, Karl R Wotton, Eva Jim enez-Guri, and Johannes Jaeger. Gap gene regulatory dynamics
evolve along a genotype network. Molecular biology and evolution, 33(5):1293{1307, 2016b. (page 40),
(page 60)
Chiraj K Dalal and Alexander D Johnson. How transcription circuits explore alternative architectures while
maintaining overall circuit output. Genes & Development, 31(14):1397{1405, 2017. (page 41), (page 60)
Chiraj K Dalal, Ignacio A Zuleta, Kaitlin F Mitchell, David R Andes, Hana El-Samad, and Alexander D
Johnson. Transcriptional rewiring over evolutionary timescales changes quantitative and qualitative prop-
erties of gene expression. eLife, 5:e18981, 2016. (page 41)
87
Charles Robert Darwin. On the Origin of Species by Means of Natural Selection, or the Preservation of
Favoured Races in the Struggle for Life. John Murray, Albermarle Street, London, 1859. (page 4)
Eric H Davidson and Douglas H Erwin. Gene regulatory networks and the evolution of animal body plans.
Science, 311(5762):796{800, 2006. (page 40)
Jeremy Draghi and Michael Whitlock. Robustness to noise in gene expression evolves despite epistatic
constraints in a model of gene networks. Evolution, 69(9):2345{2358, 2015. (page 40)
James PC Dumont and R Meldrum Robertson. Neuronal circuits: an evolutionary perspective. Science, 233
(4766):849{853, 1986. (page 69)
Thyago Duque, Md Abul Hassan Samee, M Kazemian, Hannah N Pham, Michael H Brodsky, and Saurabh
Sinha. Simulations of enhancer evolution provide mechanistic insights into gene regulation. Molecular
biology and evolution, 31(1):184{200, January 2014. (page 11)
Gerald M Edelman and Joseph A Gally. Degeneracy and complexity in biological systems. Proceedings of
the National Academy of Sciences, 98(24):13763{13768, 2001. (page 40)
Manfred Eigen, John McCaskill, and Peter Schuster. Molecular quasi-species. The Journal of Physical
Chemistry, 92(24):6881{6891, 1988. (page 26)
Janna L Fierst and Thomas F Hansen. Genetic architecture and postzygotic reproductive isolation: evolution
of Bateson-Dobzhansky-Muller incompatibilities in a polygenic model. Evolution, 2009. (page 61)
R. A. Fisher. The genetical theory of natural selection. Oxford University Press, Oxford, 1930. ISBN
0-19-850440-3. URL http://www.archive.org/details/geneticaltheoryo031631mbp. (page 53)
C Fra sse, P A Gunnarsson, D Roze, N Bierne, and J J Welch. The genetics of speciation: Insights from
Fisher's geometric model. Evolution, 70(7):1450{1464, 07 2016. doi: 10.1111/evo.12968. URL https:
//www.ncbi.nlm.nih.gov/pubmed/27252049. (page 41), (page 53), (page 60)
S.M. Gallo, D.T. Gerrard, D. Miner, M. Simich, B. Des Soye, C.M. Bergman, and M.S. Halfon. Red
y v3.0:
Toward a comprehensive database of transcriptional regulatory elements in Drosophila. Nucleic Acids
Research, 10:1{6, 2010. (page 13)
Jackie Gavin-Smyth and Daniel R Matute. Embryonic lethality leads to hybrid male inviability in hybrids
between Drosophila melanogaster and D. santomea. Ecology and Evolution, 3(6):1580{1589, 2013. (page
60)
Sergey Gavrilets. Fitness landscapes and the origin of species (MPB-41), volume 41. Princeton University
Press, 2004. (page 54)
Stephen Jay Gould and Richard C Lewontin. The spandrels of san marco and the panglossian paradigm: a
critique of the adaptationist programme. Proc. R. Soc. Lond. B, 205(1161):581{598, 1979. (page 69)
Michael W Gray, Julius Luke s, John M Archibald, Patrick J Keeling, and W Ford Doolittle. Irremediable
complexity? Science, 330(6006):920{921, 2010. (page 69)
M Grewal and K Glover. Identiability of linear and nonlinear dynamical systems. IEEE Transactions on
Automatic Control, 21(6):833{837, Dec 1976. doi: 10.1109/TAC.1976.1101375. (page 40)
Ryan N Gutenkunst, Joshua J Waterfall, Fergal P Casey, Kevin S Brown, Christopher R Myers, and James P
Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS Computational Biology,
3(10):e189, 2007. (page 61)
Jayson Guti errez and Steven Maere. Modeling the evolution of molecular systems from a mechanistic
perspective. Trends in plant science, 19(5):292{303, 2014. (page 11)
88
Wilfried Haerty and Rama S Singh. Gene regulation divergence is a major contributor to the evolution of
Dobzhansky{Muller incompatibilities between species of Drosophila. Molecular Biology and Evolution, 23
(9):1707{1714, 2006. (page 60)
JBS Haldane. Sex ratio and unisexual sterility in hybrid animals. Journal of Genetics, 12(2):101{109, 1922.
(page 7), (page 52)
Thomas F. Hansen and Emilia P. Martins. Translating between microevolutionary process and macroevolu-
tionary patterns: The correlation structure of interspecic data. Evolution, 50(4):1404{1417, 1996. ISSN
00143820, 15585646. URL http://www.jstor.org/stable/2410878. (page 62)
Emily E Hare, Brant K Peterson, Venky N Iyer, Rudolf Meier, and Michael B Eisen. Sepsid even-skipped
enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS genetics,
4(6):e1000106, June 2008a. (page 12)
Emily E Hare, Brant K Peterson, Venky N Iyer, Rudolf Meier, and Michael B Eisen. Sepsid even-skipped
enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genetics,
4(6):e1000106, 2008b. (page 41)
Bin Z He, Alisha K Holloway, Sebastian J Maerkl, and Martin Kreitman. Does positive selection drive
transcription factor binding site turnover? A test with Drosophila cis-regulatory modules. PLoS genetics,
7(4):e1002053, April 2011. (page 12)
Xin He, Duque, Thyago S. P. C., and Saurabh Sinha. Evolutionary origins of transcription factor binding
site clusters. Molecular biology and evolution, 29(3):1059{1070, March 2012. (page 11), (page 12), (page
25)
S Blair Hedges, Julie Marin, Michael Suleski, Madeline Paymer, and Sudhir Kumar. Tree of life reveals
clock-like speciation and diversication. Molecular Biology and Evolution, 32(4):835{845, 2015. (page 61)
Jennifer W Israel, Megan L Martik, Maria Byrne, Elizabeth C Ra, Rudolf A Ra, David R McClay, and
Gregory A Wray. Comparative developmental transcriptomics reveals rewiring of a highly conserved gene
regulatory network during a major life history switch in the sea urchin genus Heliocidaris. PLoS Biology,
14(3):e1002391, 2016. (page 40)
Johannes Jaeger. The gap gene network. Cellular and Molecular Life Sciences, 68(2):243{274, 2011. (page
40)
Johannes Jaeger, Svetlana Surkova, Maxim Blagov, Hilde Janssens, David Kosman, Konstantin N Kozlov,
Ekaterina Myasnikova, Carlos E Vanario-Alonso, Maria Samsonova, David H Sharp, et al. Dynamic control
of positional information in the early Drosophila embryo. Nature, 430(6997):368{371, 2004. (page 40)
Alexander D Johnson. The rewiring of transcription circuits in evolution. Current Opinion in Genetics &
Development, 47:121{127, 2017. (page 60)
Rudolf E. Kalman. Mathematical description of linear dynamical systems. J. SIAM Control, 1963. (page
6), (page 40), (page 45), (page 59), (page 71)
Rudolf E. Kalman, Peter L. Falb, and Michael A. Arbib. Topics in mathematical system theory. McGraw-Hill,
New York, 1969. ISBN 0754321069. (page 6), (page 45), (page 72)
M. Kasowski, F. Grubert, C. Heelnger, M. Hariharan, A. Asabere, S. M. Waszak, L. Habegger, J. Ro-
zowsky, M. Shi, A. E. Urban, M. Y. Hong, K. J. Karczewski, W. Huber, S. M. Weissman, M. B. Gerstein,
J. O. Korbel, and M. Snyder. Variation in transcription factor binding among humans. Science, 328(5975):
232{235, April 2010. (page 59)
89
Ah-Ram Kim, Carlos Martinez, John Ionides, Alexandre F Ramos, Michael Z Ludwig, Nobuo Ogawa,
David H Sharp, and John Reinitz. Rearrangements of 2.5 kilobases of noncoding DNA from the Drosophila
even-skipped locus dene predictive rules of genomic cis-regulatory logic. PLoS genetics, 9(2):e1003243,
2013. (page 11)
Motoo Kimura. Possibility of extensive neutral evolution under stabilizing selection with special reference
to nonrandom usage of synonymous codons. Proceedings of the National Academy of Sciences, 78(9):
5773{5777, 1981. (page 41)
Motoo Kimura et al. Evolutionary rate at the molecular level. Nature, 217(5129):624{626, 1968. (page 4)
Jack Lester King and Thomas H Jukes. Non-darwinian evolution. Science, 164(3881):788{798, 1969. (page
4)
K N Kozlov, Vitaly V Gursky, Ivan Kulakovskiy, and Maria G Samsonova. Sequence-based model of gap
gene regulatory network. BMC genomics, 15 Suppl 12(Suppl 12):S6, 2014. (page 11), (page 26)
K N Kozlov, Vitaly V Gursky, Ivan V Kulakovskiy, Arina Dymova, and Maria G Samsonova. Analysis of
functional importance of binding sites in the Drosophila gap gene network model. BMC genomics, 16
Suppl 13(Suppl 13):S7, December 2015a. (page 11), (page 13), (page 14), (page 15), (page 26)
Konstantin Kozlov, Vitaly V Gursky, Ivan V Kulakovskiy, Arina Dymova, and Maria Samsonova. Analysis
of functional importance of binding sites in the Drosophila gap gene network model. BMC Genomics,
2015b. (page 40)
I V Kulakovskiy, V A Boeva, A V Favorov, and V J Makeev. Deep and wide digging for binding motifs in
ChIP-Seq data. Bioinformatics, 26(20):2622{3, 2010. doi: 10.1093/bioinformatics/btq488. (page 13)
I.V. Kulakovskiy and V.J. Makeev. Discovery of dna motifs recognized by transcription factors through
integration of dierent experimental sources. Biophysics, 54(6):667{674, 2009. ISSN 0006-3509. doi:
10.1134/S0006350909060013. URL http://dx.doi.org/10.1134/S0006350909060013. (page 13)
Ivan V Kulakovskiy, Alexander V Favorov, and Vsevolod J Makeev. Motif discovery and motif nd-
ing from genome-mapped DNase footprint data. Bioinformatics, 25(18):2318{25, 2009. doi: 10.1093/
bioinformatics/btp434. (page 13)
Russell Lande. Models of speciation by sexual selection on polygenic traits. Proceedings of the Na-
tional Academy of Sciences, 78(6):3721{3725, 1981. URL http://www.pnas.org/content/78/6/3721.
abstract. (page 53), (page 54), (page 62)
Russell Lande and Stevan J. Arnold. The measurement of selection on correlated characters. Evolution, 37
(6):1210{1226, 1983. ISSN 00143820, 15585646. URL http://www.jstor.org/stable/2408842. (page
53)
Tuuli Lappalainen, Michael Sammeth, Marc R Friedl ander, Peter AC 't Hoen, Jean Monlong, Manuel A
Rivas, Mar Gonzalez-Porta, Natalja Kurbatova, Thasso Griebel, Pedro G Ferreira, et al. Transcriptome
and genome sequencing uncovers functional variation in humans. Nature, 501(7468):506{511, 2013. (page
59)
Hugo Lavoie, Herv e Hogues, Jaideep Mallick, Adnane Sellam, Andr e Nantel, and Malcolm Whiteway. Evo-
lutionary tinkering with conserved components of a transcriptional regulatory network. PLoS Biology, 8
(3):e1000329, 2010. (page 41)
Richard C Lewontin. The genetic basis of evolutionary change, volume 560, chapter The Structure of
Evolutionary Genetics, pages 12{16. Columbia University Press New York, 1974. (page 4)
90
X.-Y. Li, S. Thomas, P. J. Sabo, M. B. Eisen, J. A. Stamatoyannopoulos, and M. D. Biggin. The role of
chromatin accessibility in directing the widespread, overlapping patterns of drosophila transcription factor
binding. Genome Biology, 12:R34, 2011. (page 13)
Dorothea Lindtke and C Alex Buerkle. The genetic architecture of hybrid incompatibilities and their eect
on barriers to introgression in secondary contact. Evolution, 69(8):1987{2004, 2015. (page 61)
Julius Luke s, John M Archibald, Patrick J Keeling, W Ford Doolittle, and Michael W Gray. How a neutral
evolutionary ratchet can build cellular complexity. IUBMB life, 63(7):528{537, 2011. (page 69)
Richard W Lusk and Michael B Eisen. Evolutionary mirages: selection on binding site composition creates
the illusion of conserved grammars in Drosophila enhancers. PLoS genetics, 6(1):e1000829, January 2010.
(page 12), (page 25), (page 26)
Michael Lynch. The evolution of genetic networks by non-adaptive processes. Nature Reviews Genetics, 8
(10):803{813, October 2007a. (page 12), (page 25)
Michael Lynch. The evolution of genetic networks by non-adaptive processes. Nature Reviews Genetics, 8
(10):803{813, 2007b. (page 69), (page 83)
Michael Lynch and Allan Force. The probability of duplicate gene preservation by subfunctionalization.
Genetics, 154(1):459{473, 2000a. (page 69)
Michael Lynch and Allan G Force. The origin of interspecic genomic incompatibility via gene duplication.
The American Naturalist, 156(6):590{605, 2000b. (page 61)
Michael Lynch and William G Hill. Phenotypic evolution by neutral mutation. Evolution, 40(5):915{935,
1986. (page 57)
Katya L Mack and Michael W Nachman. Gene regulation and speciation. Trends in Genetics, 2016. (page
60)
Shamoni Maheshwari and Daniel A Barbash. Cis-by-trans regulatory divergence causes the asymmetric
lethal eects of an ancestral hybrid incompatibility gene. PLoS Genetics, 8(3):e1002597, 2012. (page 60)
Guillaume Martin. Fisher's geometrical model emerges as a property of complex integrated phenotypic
networks. Genetics, 197(1):237{255, February 2014. doi: 10.1534/genetics.113.160325. URL https:
//www.genetics.org/content/197/1/237. (page 53)
Carlos Martinez, Joshua S Rest, Ah-Ram Kim, Michael Ludwig, Martin Kreitman, Kevin White, and John
Reinitz. Ancestral resurrection of the Drosophila S2E enhancer reveals accessible evolutionary paths
through compensatory change. Molecular biology and evolution, 31(4):903{916, April 2014. (page 11)
Takeshi Matsui, Robert Linder, Joann Phan, Fabian Seidl, and Ian M Ehrenreich. Regulatory rewiring in a
cross causes extensive genetic heterogeneity. Genetics, 201(2):769{777, 2015. (page 41)
Ernst Mayr. The biological species concept. Species concepts and phylogenetic theory: a debate. Columbia
University Press, New York, pages 17{29, 2000. (page 48)
Pawel Michalak and Mohamed AF Noor. Association of misexpression with sterility in hybrids of Drosophila
simulans and D. mauritiana. Journal of Molecular Evolution, 59(2):277{282, 2004. (page 60)
Eric Mjolsness, David H Sharp, and John Reinitz. A connectionist model of development. Journal of
Theoretical Biology, 152(4):429{453, 1991. (page 40)
Hermann Joseph Muller. Some genetic aspects of sex. The American Naturalist, 66(703):118{138, 1932.
(page 83)
91
Masatoshi Nei, Takeo Maruyama, and Chung-I Wu. Models of evolution of reproductive isolation. Genetics,
103(3):557{579, 1983. (page 61)
Timothy W Nilsen. The spliceosome: the most complex macromolecular machine in the cell? Bioessays, 25
(12):1147{1149, 2003. (page 69)
H Allen Orr. Haldane's rule. Annual Review of Ecology and Systematics, 28(1):195{218, 1997. (page 52)
H Allen Orr, John P Masly, and Daven C Presgraves. Speciation genes. Current Opinion in Genetics &
Development, 14(6):675{679, 2004. (page 6), (page 61)
Mathilde Paris, Tommy Kaplan, Xiao-Yong Li, Jacqueline E Villalta, Susan E Lott, and Michael B Eisen.
Extensive divergence of transcription factor binding in Drosophila embryos with highly conserved gene
expression. PLoS genetics, 9(9):e1003748, 2013. (page 12)
David S Parker, Michael A White, Andrea I Ramos, Barak A Cohen, and Scott Barolo. The cis-regulatory
logic of Hedgehog gradient responses: key roles for gli binding anity, competition, and cooperativity.
Science signaling, 4(176):ra38{ra38, 2011. (page 11)
Mihaela Pavlicev and G unter P Wagner. A model of developmental evolution: selection, pleiotropy and
compensation. Trends in Ecology & Evolution, 27(6):316{322, 2012. (page 40)
Y.P. Petrov and V.S. Sizikov. Well-posed, ill-posed, and intermediate problems with applications, volume 49.
Walter de Gruyter, 2005. (page 40)
Patrick C Phillips. Maintenance of polygenic variation via a migration{selection balance under uniform
selection. Evolution, 50(3):1334{1339, 1996. (page 41)
Matthew Piazza, Xiao-Jiang Feng, Joshua D Rabinowitz, and Herschel Rabitz. Diverse metabolic model
parameters generate similar methionine cycle dynamics. Journal of Theoretical Biology, 251(4):628{639,
2008. (page 61)
Andrei Pisarev, Ekaterina Poustelnikova, Maria Samsonova, and John Reinitz. Flyex, the quantitative atlas
on segmentation gene expression at cellular resolution. Nucleic Acids Research, 37:D560{566, 2009. (page
14)
Bj orn Pl otner, Markus Nurmi, Axel Fischer, Mutsumi Watanabe, Korbinian Schneeberger, Svante Holm,
Neha Vaid, Mark Aurel Sch ottler, Dirk Walther, Rainer Hoefgen, et al. Chlorosis caused by two recessively
interacting genes reveals a role of RNA helicase in hybrid breakdown in Arabidopsis thaliana. The Plant
Journal, 2017. (page 59)
Art Poon and Sarah P. Otto. Compensating for our load of mutations: freezing the meltdown of small
populations. Evolution, 54(5):1467{1479, 2000. ISSN 1558-5646. doi: 10.1111/j.0014-3820.2000.tb00693.x.
URL http://dx.doi.org/10.1111/j.0014-3820.2000.tb00693.x. (page 53)
Adam H Porter and Norman A Johnson. Speciation despite gene
ow when developmental pathways evolve.
Evolution, 56(11):2103{2111, 2002. (page 41)
Daven C Presgraves. The molecular evolutionary basis of species formation. Nature Reviews Genetics, 11
(3):175{180, 2010. (page 61)
Andrea I Ramos and Scott Barolo. Low-anity transcription factor binding sites shape morphogen responses
and enhancer evolution. Philosophical transactions of the Royal Society of London. Series B, Biological
sciences, 368(1632):20130018{20130018, December 2013. (page 11)
92
Ulises Rosas, Nick H. Barton, Lucy Copsey, Pierre Barbier de Reuille, and Enrico Coen. Cryptic variation
between species and the basis of hybrid performance. PLOS Biology, 8(7):1{12, 07 2010. doi: 10.1371/
journal.pbio.1000429. URL https://doi.org/10.1371/journal.pbio.1000429. (page 41), (page 54),
(page 55)
Camille Roux, Christelle Fraisse, Jonathan Romiguier, Yoann Anciaux, Nicolas Galtier, and Nicolas Bierne.
Shedding light on the grey zone of speciation along a continuum of genomic divergence. PLoS Biology, 14
(12):e2000234, 2016. (page 61)
Isaac Salazar-Ciudad and Miquel Mar n-Riera. Adaptive dynamics under development-based genotype-
phenotype maps. Nature, 497(7449):361{364, May 2013. (page 11)
Md Abul Hassan Samee and Saurabh Sinha. Evaluating thermodynamic models of enhancer activity on
cellular resolution gene expression data. Methods, 62(1):79{90, July 2013. (page 11)
Aziz Sancar. The intelligent clock and the rube goldberg clock. Nature structural & molecular biology, 15
(1):23{24, 2008. (page 69), (page 82)
J.S. Schiman and P.L. Ralph. System drift and speciation. bioRxiv, page 231209, 2018. (page 82), (page
85)
Dolph Schluter. Evidence for ecological speciation and its alternative. Science, 323(5915):737{741, 2009.
(page 61)
D Schmidt, M D Wilson, B Ballester, P C Schwalie, G D Brown, A Marshall, C Kutter, S Watt, C P
Martinez-Jimenez, S Mackay, I Talianidis, P Flicek, and D T Odom. Five-vertebrate ChIP-seq reveals
the evolutionary dynamics of transcription factor binding. Science, 328(5981):1036{1040, May 2010. doi:
10.1126/science.1186176. URL https://www.ncbi.nlm.nih.gov/pubmed/20378774. (page 58)
Ole Seehausen, Roger K Butlin, Irene Keller, Catherine E Wagner, Janette W Boughman, Paul A Hohenlohe,
Catherine L Peichel, Glenn-Peter Saetre, Claudia Bank,
Ake Br annstr om, et al. Genomics and the origin
of species. Nature Reviews Genetics, 15(3):176{192, 2014. (page 61)
G Sella and A E Hirsh. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci
U S A, 102(27):9541{9546, July 2005. doi: 10.1073/pnas.0501865102. URL http://www.ncbi.nlm.nih.
gov/pubmed/15980155. (page 60)
Maria R Servedio, Yaniv Brandvain, Sumit Dhole, Courtney L Fitzpatrick, Emma E Goldberg, Caitlin A
Stern, Jeremy Van Cleve, and D Justin Yeh. Not just a theory { the utility of mathematical models in
evolutionary biology. PLoS Biology, 12(12):e1002017, 2014. (page 40)
Mark L Siegal and Aviv Bergman. Waddington's canalization revisited: developmental stability and evolu-
tion. Proceedings of the National Academy of Sciences, 99(16):10528{10532, 2002. (page 40)
Alexis Simon, Nicolas Bierne, and John J. Welch. Coadapted genomes and selection on hybrids: Fisher's
geometric model explains a variety of empirical patterns. bioRxiv, 2017. doi: 10.1101/237925. URL
https://www.biorxiv.org/content/early/2017/12/27/237925. (page 53), (page 60)
Montgomery Slatkin and Russell Lande. Segregation variance after hybridization of isolated populations.
Genetics Research, 64(1):51{56, 1994. (page 41), (page 55), (page 65)
Dave Speijer. Does constructive neutral evolution play an important role in the origin of cellular complexity?
making sense of the origins and uses of biological complexity. Bioessays, 33(5):344{349, 2011. (page 69),
(page 83)
K. Steova, D. Thybert, M. D. Wilson, I. Streeter, J. Aleksic, P. Karagianni, A. Brazma, D. J. Adams,
I. Talianidis, J. C. Marioni, P. Flicek, and D. T. Odom. Cooperativity and rapid evolution of cobound
transcription factors in closely related mammals. Cell, 154(3):530{540, August 2013. (page 58)
93
A J Stewart and J B Plotkin. The evolution of complex gene regulation by low-specicity binding sites.
Proceedings of the Royal Society B: Biological Sciences, 280(1768):20131313{20131313, August 2013. (page
25)
Arlin Stoltzfus. On the possibility of constructive neutral evolution. Journal of Molecular Evolution, 49(2):
169{181, 1999. (page 8), (page 69)
Arlin Stoltzfus. Constructive neutral evolution: exploring evolutionary theorys curious disconnect. Biology
direct, 7(1):35, 2012. (page 69), (page 84)
J A Sved. A two-sex polygenic model for the evolution of premating isolation. ii. computer simulation of
experimental selection procedures. Genetics, 97(1):217{235, January 1981. URL https://www.ncbi.nlm.
nih.gov/pubmed/17249074. (page 41)
E ors Szathm ary and John Maynard Smith. The major evolutionary transitions. Nature, 374(6519):227{232,
1995. (page 8), (page 69)
Amos Tanay, Aviv Regev, and Ron Shamir. Conservation and evolvability in regulatory networks: the
evolution of ribosomal regulation in yeast. Proceedings of the National Academy of Sciences of the United
States of America, 102(20):7203{7208, 2005. (page 41)
John R True and Eric S Haag. Developmental system drift and
exibility in evolutionary trajectories.
Evolution & Development, 3(2):109{119, 2001. (page 6), (page 40), (page 41), (page 59), (page 85)
Annie E Tsong, Brian B Tuch, Hao Li, and Alexander D Johnson. Evolution of alternative transcriptional
circuits with identical logic. Nature, 443(7110):415{420, 2006. (page 41)
Alexander Y Tulchinsky, Norman A Johnson, Ward B Watt, and Adam H Porter. Hybrid incompatibility
arises in a sequence-based bioenergetic model of transcription factor binding. Genetics, 198(3):1155{1166,
2014. (page 41)
Jenny Tung, Xiang Zhou, Susan C Alberts, Matthew Stephens, and Yoav Gilad. The genetic architecture of
gene expression levels in wild baboons. eLife, 4:e04729, 2015. (page 59)
Michael Turelli and Nicholas H. Barton. Genetic and statistical analyses of strong selection on polygenic
traits: What, me normal? Genetics, 138(3):913{941, November 1994. URL https://www.ncbi.nlm.nih.
gov/pmc/articles/PMC1206238/. (page 61)
Michael Turelli and H Allen Orr. The dominance theory of Haldane's rule. Genetics, 140(1):389{402, 1995.
(page 7), (page 52)
AJ Van der Schaft. Equivalence of dynamical systems by bisimulation. IEEE transactions on automatic
control, 49(12):2160{2172, 2004. (page 40)
Dominique J Verlaan, Bing Ge, Elin Grundberg, Rose Hoberman, Kevin CL Lam, Vonda Koka, Joana Dias,
Scott Gurd, Nicolas W Martin, Hans Mallmin, et al. Targeted screening of cis-regulatory variation in
human haplotypes. Genome research, 19(1):118{127, 2009. (page 59)
Je Vierstra, Eric Rynes, Richard Sandstrom, Miaohua Zhang, Theresa Caneld, R Scott Hansen, Sandra
Stehling-Sun, Peter J Sabo, Rachel Byron, Richard Humbert, et al. Mouse regulatory DNA landscapes
reveal global principles of cis-regulatory evolution. Science, 346(6212):1007{1012, 2014. (page 41)
Andreas Wagner. Evolution of gene networks by gene duplications: a mathematical model and its implica-
tions on genome organization. Proceedings of the National Academy of Sciences, 91(10):4387{4391, 1994.
(page 11), (page 40)
Andreas Wagner. Does evolutionary plasticity evolve? Evolution, pages 1008{1023, 1996. (page 40)
94
Eric Walter, Yves Lecourtier, and John Happel. On the structural output distinguishability of parametric
models, and its relations with structural identiability. IEEE Transactions on Automatic Control, 29(1):
56{57, 1984. (page 40)
Mi Wang, Severin Uebbing, and Hans Ellegren. Bayesian inference of allele-specic gene expression indicates
abundant cis-regulatory variation in natural
ycatcher populations. Genome Biology and Evolution, 9(5):
1266{1279, 2017. (page 59)
James D Watson and Francis HC Crick. Genetical implications of the structure of deoxyribonucleic acid.
Nature, 171(4361):964{967, 1953. (page 4)
Kenneth M Weiss and Stephanie M Fullerton. Phenogenetic drift and the evolution of genotype{phenotype
relationships. Theoretical Population Biology, 57(3):187{195, 2000. (page 6), (page 40), (page 59)
Karl R Wotton, Eva Jim enez-Guri, Anton Crombach, Hilde Janssens, Anna Alcaine-Colet, Steen Lemke,
Urs Schmidt-Ott, and Johannes Jaeger. Quantitative system drift compensates for altered maternal inputs
to the gap gene network of the scuttle
y Megaselia abdita. eLife, 4:e04785, 2015. (page 40), (page 60)
S. Wright. Evolution and the Genetics of Populations, Volume 1: Genetic and Biometric Foundations.
Evolution and the Genetics of Populations. University of Chicago Press, 1968. ISBN 9780226910383.
(page 55)
Sewall Wright. Evolution in populations in approximate equilibrium. Journal of Genetics, 30(2):257, 1935.
URL http://link.springer.com/content/pdf/10.1007/BF02982240.pdf. (page 41)
Lot A Zadeh and Charles A Deoser. Linear system theory. Robert E. Krieger Publishing Company
Huntington, 1976. (page 6), (page 40)
95
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Investigating the evolution of gene networks through simulated populations
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Understanding the genetics, evolutionary history, and biomechanics of the mammalian penis bone
PDF
Robustness and stochasticity in Drosophila development
PDF
Complex mechanisms of cryptic genetic variation
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
PDF
Genetic architectures of phenotypic capacitance
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Decoding the embryo: on spatial and genomic tools to characterize gene regulatory networks in development
PDF
Exploring the genetic basis of quantitative traits
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
From gamete to genome: evolutionary consequences of sexual conflict in house mice
PDF
Applications of next generation sequencing in sessile marine invertabrates
PDF
Genomic and physiological characterization of Escherichia coli evolving in long-term batch culture
PDF
Studies in bivalve aquaculture: metallotoxicity, microbiome manipulations, and genomics & breeding programs with a focus on mutation rate
PDF
Genetic and molecular insights into the genotype-phenotype relationship
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
Investigating the potential roles of three mammalian traits in female reproductive investment
PDF
The evolution of knowledge creation online: what is driving the dynamics of Wikipedia networks
PDF
Investigating hematopoietic stem cells via multiscale modeling, single-cell genomics, and gene regulatory network inference
Asset Metadata
Creator
Schiffman, Joshua Seth
(author)
Core Title
The evolution of gene regulatory networks
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
10/30/2018
Defense Date
08/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Evolution,gene regulatory networks,OAI-PMH Harvest,speciation,Systems Biology
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ralph, Peter (
committee chair
), Dean, Matthew (
committee member
), Ehrenreich, Ian (
committee member
), Nuzhdin, Sergey (
committee member
)
Creator Email
joshseth@gmail.com,jsschiff@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-100072
Unique identifier
UC11675175
Identifier
etd-SchiffmanJ-6919.pdf (filename),usctheses-c89-100072 (legacy record id)
Legacy Identifier
etd-SchiffmanJ-6919.pdf
Dmrecord
100072
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Schiffman, Joshua Seth
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
gene regulatory networks
speciation