Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Investigating the evolution of gene networks through simulated populations
(USC Thesis Other)
Investigating the evolution of gene networks through simulated populations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Investigating the Evolution of Gene Networks Through Simulated Populations by Christopher Conow A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computational Biology and Bioinformatics) May 2020 Copyright 2020 Christopher Conow Acknowledgments First and foremost, I would like to thank my advisor, Sergey Nuzhdin, who has been nothing but supportive and patient in my academic journey at USC. Without his guidance and the environment of academic curiosity he fosters, I would not have been able to accomplish this. Speaking of that environment, I am so grateful for the wonderful array of people I have had the pleasure of working with in the Nuzhdin lab over my career as a graduate student. I would also like to acknowledge my friends and family, who have encouraged and believed in me from the beginning. ii Table of Contents Acknowledgments ii List Of Tables vi List Of Figures xi Abstract xxi Introduction 1 Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 1: Simulating the evolution of gene networks through a population of reproducing digital organisms 7 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Existing Computational methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Analytic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Grn-evo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Gene regulatory networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.4 Phenotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.5 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.5.1 Recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.5.2 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.6 Fitness and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.6.1 Death . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.6.2 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.7 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iii 1.4.8 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.1 \Experiments" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.5.1.1 Environmental Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.5.1.2 Targets of Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.5.2 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.5.2.1 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.5.2.2 Chromosome Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.5.2.3 Gene Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.6.1 Detectable dierences in evolution . . . . . . . . . . . . . . . . . . . . . . . . 35 1.6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.7 Code Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 2: Characterizing the utility of network statistics as a means of inferring evolutionary forces 37 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.1 Network Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.2 Statistics Collected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.3 Analysis and Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 3: Examining preliminary results and analysis of comparing stable popu- lations against their past selves 50 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.3 Point Mutation Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.4 Recombination Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 4: Perspectives and Future Directions 63 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Future \experiments" and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Previously mentioned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 Further directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Future extensions and modications to grn-evo . . . . . . . . . . . . . . . . . . . . . 68 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Reference List 70 iv Appendix A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Appendix B Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v List Of Tables 1.1 Values used to set environmental signals and phenotypic perturbations . . . . . . . . 26 2.1 Error matrices when testing validation sets against random forest classiers where test and validation sets are subset in a specic way. In the \Individual" column, we randomly sample 70% of individuals across the whole dataset for the training set, with the remaining 30% being used as the validation set. In the \Run" column, we subset by runs so that all observations from 70% of runs are used in the training set while all observations in the remaining 30% of runs are used in the validation set. The factor being used to train a given classier is shown on the left, with the specic classes that factor has just to the right. This gives us eight classiers shown, where each row before the dashed line has the number of times observations in the test set were classied as a given type (i.e. values in the rst column represent the number of times an observation labeled by the class in a given row was predicted to be the rst class, and so on). Diagonals represent correct classication. The values to the right of the dashed lines represent the error associated with classifying observations who belong to the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.2 Confusion matrices estimated by out-of-bag (OOB) events observed during training of a random forest ensemble on summary network statistics on each of the 1440 simulations run (i.e. one observation per run of the simulation). The factor being used to train a given classier is shown on the left, with the specic classes that factor has just to the right. This gives us four classiers shown, where each row before the dashed line has the number of times observations in the automatic OOB tests were classied as a given type (i.e. values in the rst column represent the number of times an individual observation by the class in a given row was predicted to be the rst class, and so on). Diagonals represent correct classication. The values to the right of the dashed lines represent the error associated with classifying observations who belong to the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . 44 vi 3.1 Confusion matrices when random forest ensembles trained on the rst 5; 000 itera- tions of a simulation are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the four classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . 52 A.1 (Constant environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with constant environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.2 (Cycle environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with cycle environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.3 (Rand environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with rand environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 vii A.4 (Rand hom environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with rand hom environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A.5 (hm target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting hm are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out- of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 77 A.6 (sin target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting sin are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out- of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 77 A.7 (t3 target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting t3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out- of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 78 A.8 (t4 target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting t4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out- of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 78 viii A.9 (Mutation rate 10 3 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.10 (Mutation rate 10 4 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.11 (Mutation rate 10 5 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 5 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.12 (Recombination rate 10 3 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 ix A.13 (Recombination rate 10 4 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.14 (Recombination rate 10 5 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 5 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 x List Of Figures 1.1 Phenotypic output selected for usingt3 for the environmental signal given above each plot. Only phenotype 2 is sensitive to the environment. . . . . . . . . . . . . . . . . 27 1.2 Phenotypic output selected for usingt4 for the environmental signal given above each plot. Only phenotype 1 is sensitive to the environment. . . . . . . . . . . . . . . . . 28 1.3 Phenotypic output selected for using sin for the environmental signal given above each plot. All phenotypes are sensitive to the environment. . . . . . . . . . . . . . . 29 1.4 Phenotypic output selected for using hm for the environmental signal given above each plot. All phenotypes (including the initial values of phenotypes 2 and 3) are sensitive to the environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.5 Solid lines are phenotypic output produced by an individual taken from the nal generation of a population adapted to produce the t3 phenotype for the environ- mental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 . . . . . . . . . . . . . . . . . . 31 1.6 Solid lines are phenotypic output produced by an individual taken from the nal generation of a population adapted to produce the t4 phenotype for the environ- mental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 . . . . . . . . . . . . . . . . . . 31 1.7 Solid lines are phenotypic output produced by an individual taken from the nal generation of a population adapted to produce the sin phenotype for the environ- mental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation rate at at 10 3 and recombination rate at 10 5 . . . . . . . . . . . . . 32 xi 1.8 Solid lines are phenotypic output produced by an individual taken from the nal generation of a population adapted to produce the hm phenotype for the environ- mental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 . . . . . . . . . . . . . . . . . . 33 2.1 OOB error estimates as trees are added to a random forest ensemble training to identify targets using data aggregated across whole runs. . . . . . . . . . . . . . . . . 40 2.2 Principle component analysis using network statistics of the 50 ttest individuals per iteration per run for the full dataset. The loadings for each subplot are identical and calculated with the full set. The plot is faceted in order to make comparisons between and among target types and environment types more clear. Coloring is by unique run per subplot, and is only indicative of a relationship between points within a given subplot rather than between subplots. . . . . . . . . . . . . . . . . . . . . . . 46 B.1 Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 83 B.2 Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by muta- tion rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . 84 B.3 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mu- tation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . 85 B.4 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotypet3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 B.5 Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 87 B.6 Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 88 xii B.7 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 89 B.8 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotypet3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 B.9 Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 91 B.10 Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by muta- tion rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . 92 B.11 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mu- tation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . 93 B.12 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotypet4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 B.13 Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 95 B.14 Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 96 B.15 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 97 xiii B.16 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotypet4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 B.17 Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 99 B.18 Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 100 B.19 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 101 B.20 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 B.21 Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 103 B.22 Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 104 B.23 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 105 xiv B.24 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 B.25 Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 107 B.26 Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 108 B.27 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 109 B.28 Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.29 Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 111 B.30 Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 112 B.31 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . 113 xv B.32 Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.33 Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 115 B.34 Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 116 B.35 Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 117 B.36 Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.37 Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 119 B.38 Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 120 B.39 Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 121 xvi B.40 Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 B.41 Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 123 B.42 Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 124 B.43 Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 125 B.44 Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 B.45 Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 127 B.46 Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 128 B.47 Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . 129 xvii B.48 Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 B.49 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.50 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotypet3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 132 B.51 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 B.52 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 B.53 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 B.54 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotypet4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . 136 B.55 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 xviii B.56 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.57 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 139 B.58 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.59 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B.60 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . 142 B.61 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 143 B.62 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B.63 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xix B.64 Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. . . . . . . . . . . . . . . . . . . . . . . . . 146 xx Abstract Studying gene networks and their evolution is hard. Inferring real networks is a daunting task that requires substantial data and sophisticated statistical methods. As such, our ability to study the evolution of biological networks is limited. Computational methods, then, are a compelling avenue explore and understand how gene networks might evolve. Most such methods are either entirely analytic or are constructed with a lot of assumptions aimed at making their results easier to characterize. In this work I introduce a new simulation framework called grn-evo that is more sophisticated than existing methods, and argue that the complexity of the resulting simulations is an asset that can be used to more deeply explore open questions in network evolution than existing methods. I additionally discuss possible future directions to take the analysis of the data sets presented, as well as further extensions to grn-evo itself. xxi Introduction & Overview Though extensive theory has been developed over the past century to explain the spread of alleles under varying circumstances[1] and substantial evidence that networks arising from the interaction of genes are ubiquitous, we know little about the evolutionary history of networks and whether their observed structure has played a role in adaptation[2]. Phenotypic robustness and evolvability vary within and between traits in natural populations, and the organization of biological networks that underlie the expression of those phenotypes drives this variation[3, 4, 5, 6, 7]. Signicant work is being done to reconstruct biological networks in order to better understand the behavior of specic networks and their sub-components[8, 9, 10, 11], but such studies have limited power to describe and understand the evolutionary origins of organizational features which underlie the behavior of a given network, and struggle to predict how that organization might aect or be aected by future evolution[12]. The diculties in studying the evolution of gene networks are both technical and conceptual in nature. While a number of metabolic and gene regulatory networks have been reconstructed utilizing genome sequencing and gene expression measurements in a range of organisms from yeast[8, 9] and E. coli[9] to fruit ies[13] and plants[14], these studies require signicant amounts of data and sophisticated statistical methods to construct models which are only able to accurately explain variation in phenotype under specic conditions[12, 15]. Thus, from an empirical standpoint, it 1 is not immediately clear to what degree certain properties common to gene networks, such as robustness and modularity, are a result of selective pressures actively shaping the properties of established networks versus a natural result of how such networks develop and are maintained in the rst place[2]. It is no wonder then that various models and simulations have been developed to computationally create tractable systems which, ideally, behave in a manner that is similar to real biology[4, 11, 16, 17, 18]. In an evolutionary simulation, all genetic, phenotypic, and environmental information can be known at all times and the eects of variation and perturbations in genotype or environment on the phenotype of an individual can be explicitly calculated. Performing repeated experiments to collect data on the evolution of a population under given circumstances then becomes an issue of computational time and space. In this way, many of the diculties present in studying the evolution of gene networks in real systems are not present in studies on simulations. Unfortunately, the computational complexity of evolutionary simulations is not trivial. Physi- cal simulations of molecular dynamics which underlie the simplest cellular functions is extremely costly[19]. Attempting to scale such simulations to the level of populations over generations is not feasible. Thus, simulations that mimic biological evolution must necessarily make abstractions, and abstractions by their nature lose some behavior inherent to the full system. Often in the process of creating such abstractions, processes are designed under simplifying assumptions that necessitate the behavior that simulation or model is meant to explore (e.g. building an evolutionary algorithm that selects for modular network structure and then nding that modular network structure evolves does not mean that evolution inherently promotes modularity; it simply demonstrates that evolu- tionary forces are able to produce modular structures when explicitly selecting for them). This can make it dicult to say whether the results are indicative of the way real biological systems evolve 2 and function or are merely an artifact of assumption and so interpreting models requires careful attention to these limitations. In this work, we introduce a new simulation of gene network evolution across populations of sexually reproducing virtual organisms. We then show that this framework is capable of running a wide array of \experiments" which exhibit emergent patterns of evolution that provide insight into variation which we observe in nature. The rest of this introduction is an outline of the chapters appearing in this document. Chapter 1 First, we describe our model and some high level results. We developed a new simulation using representations of gene-gene interactions and the manner in which mutational forces change, add, and remove genes from a population motivated by well studied cellular processes. We make modest assumptions regarding the maximum number of genes, the number of interactions between genes, and no assumptions about specic network features which may or may not be adaptive. A given \experiment" consists of a population evolving over many generations via individuals that are chosen to sexually reproduce and die with probability proportionate to tness. Fitness is calculated based on the output (phenotype) of each individual's gene network given an environmental context, and so over time a population evolves to produce the selected phenotype and adapts to its environment. High level results from several environmental, mutational, and selective contexts are shown in order to both demonstrate the exibility of the system, as well as motivate the specic parameters we will use for our experiments in subsequent chapters. We characterize the degree to which rates of mutation and recombination as well as environmental and selective pressures aect the evolutionary trajectories of various statistics. 3 Chapter 2 Next, we ask whether network structure in these simulations might in some way be in uenced pre- dictably by environment, selected phenotype, and per base mutation and recombination rates. How to identify and interpret signatures of selection on network structure remains an open question[7, 15, 20, 21]. It is attractive to assume that if a \complex" pathway exists to do something that could easily be accomplished by a simpler one, then the complexity must be evolutionarily advantageous. However, previous computational studies have found that there are often many dierent possible networks structures which can all behave identically under a given set of conditions and that these networks are often reachable from one another in a space of neutral modications[22, 23, 24, 25, 26]. These results indicate that it is likely that many \complex" network features are more indicative of heritage and history of neutral mutations than they are of direct selection on a network. We analyze replicate simulated experiments wherein populations initially are maladapted to a given environment and are allowed to evolve gene networks which \solve" the problem of producing a given adaptive phenotype subject to varying environmental and mutational perturbations, in order to observe whether network properties arise with specic selective pressures, or whether these commonly used statistics are rather indicative of a specic instance of initial adaptation independent of the environmental or mutational context. By analyzing the networks produced in the simulations from chapter one we nd that the latter is the case. Comparing distributions of network statistics, we nd that distinguishing between any two arbitrary runs of the simulation occurs readily in most cases while distinguishing between environment types or targets of selection using data from many independent replicate simulations does not. In aggregate there is evidence that selective pressures do aect the distribution of the statistics measured within a population, though still not to a large enough degree that it would be 4 generally possible to reliably identify the specic environmental or mutational forces a population experienced solely based on those statistics. Chapter 3 Building on the premise of the previous chapter, that network complexity is not necessarily a sign of selection, we explore the degree to which networks which are already adaptive change and evolve neutrally over many generations of stable selection. Since it has been argued that network features may arise or disappear in populations through random chance rather than specically being selected for or against simply due to the duplication of genes and subsequent drift in interactions between genes[2], this is something we can test in silico using these simulations. The populations used in chapter 2 serve as natural starting points to answer this second question, so we extend the number of iterations they are allowed to evolve. Knowing how much freedom a network has to evolve neutrally will allow us to more accurately identify features which arise due to dierences in selection or evolutionary history between populations. We expect the domain of neutral variation will depend signicantly on the structure of the initial network as well as the context of selection (i.e. how frequent/severe environmental and mutational perturbations are). We nd evidence that rates of mutation and recombination signicantly aect the amount of drift in network structure over time. The evidence is less consistent when comparing dierences between populations that share rates of mutation and recombination but dier in environmental uncertainty or phenotypic selection. Through these results and those of chapter 2, we can start to understand how to evaluate the signicance of network organization if we assume selective pressures remain mostly constant once they are introduced. 5 Chapter 4 Finally we discuss a number of future directions to pursue, from extensions to the analyses we discuss here to new experiments to run and even modications and additions to the capabilities of the model itself. 6 Chapter 1 Simulating the evolution of gene networks through a population of reproducing digital organisms 1.1 Introduction Studying the evolution of complex traits poses many challenges. Natural populations tend to be large with substantial genetic variation where individual variants have small eect on phenotype, and the selective benet of phenotypic variation can be highly context sensitive[27, 28, 29]. The raw amount of data required to disentangle all these eects from one another to understand the heritable basis of phenotypes and their eect on tness is astounding and usually not feasible to collect[12, 15]. While the pace at which phenotypic and genetic data is being collected has increased signicantly in the past few decades, it still is not nearly enough to answer many fundamental questions regarding the evolution of complex traits[7, 15, 20, 21]. To combat these issues in experimental settings, biologists frequently work with small sample populations that have little genetic variation outside a specic trait of interest, observed in carefully 7 controlled environmental contexts. This allows a greater degree certainty regarding the source of any phenotypic variation observed within an experiment, but does not allow investigation of more than a handful of factors which in uence phenotype at one time. As we collect more and more genetic data, being able to detect and interpret patterns in that data is and will be important to develop novel technologies that will provide solutions for a wide array of issues. 1.2 Existing Computational methods Given the aforementioned challenges associated with studying long term evolution of complex traits from a genetic standpoint, computational analysis is an attractive alternative. Such methods can broadly be split into two categories: analytic models and simulations. 1.2.1 Analytic Models In order to describe evolutionary phenomena in a way that can be expressed analytically, many assumptions are often made due to how dicult it is to derive a closed form expression that accurately describes a complex process. Factors that are dynamic in the real system may be assumed to be constant, such as the number of genes in question, the number of interactions between genes, population size, reproductive rate, etc.[4, 30] Relationships and eects are generally assumed to be linear. In making such strict assumptions, unfortunately, the range of phenomena that can accurately be modeled is restricted. The advantage of using an analytic model is that the behavior of such a model can be completely described, and when well formed absolutely yield valuable insights and results regarding the natural world. 8 Some early and fundamental examples of population genetic models would be the Wright- Fisher[31, 32] model and Moran model[33] which provide expectations for the rate of genetic change over time given various conditions. 1.2.2 Simulations Generally, simulation studies extend analytic models to allow the exploration of questions that cannot reasonably be answered through a closed form formula. This is accomplished by iteratively evaluating a set of equations that each individually describe one portion of the system being ex- amined until some stopping criteria has been satised. Such models can be used to approximate solutions to problems for which no analytic solution is known. Simulations are used to generate data based on a set of mathematical rules that have been proposed to be acceptably accurate representations of fundamental relationships that underlie a natural phenomena. The simulated data can then be analyzed in a number of ways, from testing against real data to verify the accuracy of the proposed rules that govern the model to predicting future outcomes as a basis for testable hypotheses. Because simulations are concerned with sets of outcomes rather than the whole space of possibilities, and thus do not need to be represented with closed form equations, they require fewer simplifying assumptions to model a system than a completely analytic model. An example of a simulation that relaxes a number of common assumptions in order to provide a compelling view of metabolic network evolution in asexually reproducing populations was presented by Hintze and Adami[34]. This evolutionary simulation takes a rules-based approach to dene a molecular chemistry which results in millions of possible unique genes. They allow genomes to grow and shrink in size up to a limit, and thus so do the number of genes vary. Their model served 9 as inspiration for this work, and their in uence can be seen in how we specify genes. However, their model has no concept of crossing over between lineages, whether through horizontal gene transfer or sexual reproduction, and their model of chemistry, while interesting and exible, is fairly abstract. Additionally, tness in their model is dened by simply adding up the number and amount of molecules individuals are able to metabolize in a given environment. In order to study more complex evolutionary dynamics where crossing over occurs and where more \interesting" phenotypes are selected for, we developed the model described shortly. Perhaps the most well known and widely used evolutionary simulation is called Avida[16]. It is not even strictly a simulation of biology since \organisms" are represented by a population of literal algorithms competing for CPU time and memory in order to reproduce, though its results have been used to gain insight into the evolution of complex biological systems[35, 36, 37]. While our model may not yet match the full scope and complexity of Avida, we believe it demonstrates that there is plenty of room for the development of simulation frameworks that more directly model real biological processes involved in evolution while still being computationally tractable. In an eort to develop a exible computational framework to study a wide range of possible evolutionary situations, we created grn-evo, which is a simulation that evolves populations of simple gene regulatory networks to produce selected patterns of phenotypic expression. 10 1.3 Grn-evo 1.3.1 General description Simulations in grn-evo consist of a population of diploid, sexually reproducing \organisms". Each individual possesses its own genome that codes for genes that are able to sense signals from either the environment or internal state and in turn change internal state or in uence a phenotype. Generations overlap, so organisms may live for many time steps (iterations) and the state of an organism and environment may change over its lifetime. Each iteration, an organism's phenotype is updated based on signals coming from the environment, and then phenotype is used to produce a tness value that determines the likelihood that the organism reproduces or is selected to be killed. A xed number of individuals are killed per iteration, and then are replaced so that the size of the population remains constant. An individual's phenotype is eected by the output of a network of interacting genes. These genes can be in uenced by the environment or by regulatory factors within an organism. The expression of genes creates new regulatory factors which then in uence the expression level of other genes and so on, while some genes ultimately aect phenotypic values upon which selection can act. When an organism reproduces, the chromosome it passes to the ospring may experience point mutations as well as recombination, which results in crossing over as well as a chance for tandem duplications or deletions of sections of \DNA". Through mutations, duplications, and deletions of gene sequences, the network that results from interactions between genes may have edges changed, nodes duplicated, and nodes deleted. Networks evolve in populations over generations through variation that may accrue, mix, match, and be exposed to explicit selection on output as well as implicit selection on structure due to the 11 chance that environmental or mutation perturbations disrupt the ability for signals to properly propagate to produce phenotype. 1.3.2 Goals Our intent is to provide a computational platform on which to run reasonable sized \experiments" that are not feasible to perform with real populations while iterating through an evolutionary process in a system that is too large and complex to describe analytically or completely explore empirically. This platform is exible enough that we can vary a large number of parameters and assumptions against one another without breaking the model. This lets us reasonably compare how these dierent conditions interact with one another, which provides an opportunity to study some real, general features of evolution. 1.4 Methods 1.4.1 Genome Each individual possesses a genome which is comprised of some number of homologous pairs of chromosomes. In this work, organisms only have one homologous pair. Chromosomes are repre- sented by linear string of \bases" in the range [0,3], with a single @ character denoting a centromere. The centromere is used for aligning the chromosomes to allow for recombination, which is discussed shortly. There is no strict upper limit to the size of a genome, though genomes larger than 15,000 bases experience reduced tness and the lower limit of genome size is the length of a single gene. 12 1.4.2 Genes Genes are comprised of six components, plus the sequence 0001 to indicate the start of a gene. Their specication is used to build a matrix and an initial value vector which in turn specify a system of dierential equations that describe the trajectories of a cell's phenotypic output given initial conditions. We will talk more about this in the next section. The rst component is four bases long. It is converted to a positive integer value and is taken modulo 5 to produce a number in the range [0,4] and species one of the ve following gene types. The rst type, specied by 0, is a gene whose expression is in uenced by the presence of an environmental signal or phenotypic signal and then through is able to change the internal state of the organism. That is, it has a binding site \sensitive" to environmental signal and it targets the binding sites of other genes. The second type, specied by 1, is a gene whose expression is in uenced by the internal state of the organism, and whose expression changes the phenotype of the organism. The third type, specied by 2, is a gene whose expression is in uenced in the same way as those of type 0, except their regulatory targets are phenotypes. The fourth type, specied by 3, is a gene whose expression is in uenced in the same ways as type 1, but whose expression in uences the internal state. The fth and nal type, specied by 4, is a gene which aects the initial conditions of the gene regulatory network when it is expressed. It can set the initial values for any of the internal states or phenotypic states. Environmental states cannot be in uenced by cells. The second component of a gene represents the binding sites a gene has dependent on its type. This sequence is represented by eight bases, split into two sets of four. For genes of types 0 and 2, the signal which aects it is determined by taking the value of these four bases modulo the length 13 of the environmental signal plus the length of the phenotypic signal. Note that this means there is a maximum of 256 signals, however currently we set the number of environmental signals to 1 and phenotypic signals to 3. Otherwise the binding sites are in uenced by internal state signals, and those values are taken modulo the number of possible internal signals which is currently set to 32. As an aside, the current implementation of this simulation requires the internal length to be a power of two. The third component is the active site of a gene and is represented by eight bases, held in two bytes, where we take the logical xor of the two bytes to produce a value in the range [0,255]. Similar to binding sites, if this gene in uences phenotype, the value of this site is taken modulo the number of phenotypes to determine which phenotype it in uences. If it in uences other genes, its value is just the integer value of the four bases. Active sites that aect the internal state are not perfectly specic. The next component of a gene controls the specicity of these interactions. The fourth component determines the specicity of genes which in uence the internal state of the cell. It has been shown in nature that non-specic binding is an important component to the evolution of gene regulatory networks[5, 38, 39, 40]. This component does not aect genes which in uence phenotype. Similar to how the active site is specied, we take the logical xor of the two bytes which represent it and then take it modulo the internal length to produce a value in the range of internal signals. We then look at each pair of bits, which correspond to a single \nucleotide", in this value to produce a mask which is used to determine which bases in the active site are non-specic, and which alternate bases may be recognized as a valid binding site. The following pseudocode outlines the algorithm where represents logical xor,^ logical and, and << a bitshift to the value on the left to the left by a number of bits specied by the value on the right. 14 Algorithm 1 Given an active site called target and specicity dened by spec dene the list of valid targets by vector targets = new vector for i = 0;i< 7;i+ = 2 do targets.append(target ((1<<i)^spec) targets.append(target ((2<<i)^spec) targets.append(target ((3<<i)^spec) end for return targets The fth component of a gene indicates whether this gene regulates its targets in a positive or negative manner. It is specied by four bases, or one byte. We count the number of 1 bits in the byte to produce a value in the range [0, 8]. Values in [0,2] mean negative regulation, values in the range [3, 5] mean zero regulation, and values in the range [6, 8] mean positive regulation. The reasoning behind this is to ensure that no single mutation can cause the sign of a gene's downstream regulation to ip and any change in sign must pass through zero rst by experiencing multiple mutations. A gene with a regulation direction of zero is non-functional. The sixth component is the degree to which a gene regulates its targets, which is represented by 16 bases, or four bytes. We use these four bytes to create a positive integer in the range [0; 2 32 1], then multiply the result by 0:7= 2 32 1 to obtain a positive value in the range [0, 0.7]. We use this range rather than [0, 1] to ensure that a single copy of a gene cannot have a regulatory value of 1. During testing we found that genes with \strong" downstream eects (i.e. close to 1) could lead to an extreme degree of heterozygosity because homozygotes were dramatically less t due to over-regulation of downstream targets of many genes. This value is multiplied by the direction component to determine how a gene regulates its targets. 15 1.4.3 Gene regulatory networks When each individual is \born", its genome is translated into a list of genes given the rules described in the previous section. Those genes are then used to dene an nxn matrix A where n is equal to the number of signals in the internal state plus the number of environmental signals and the number of phenotypic signals. For the simulations presented here, this results in n = 32 + 1 + 3 = 36 total signal values. This matrix is constructed by scanning through all genes of types 0-3, and putting their regulatory value in the locations n ij of N for each i corresponding to the binding sites of the gene and each j corresponding to the targets of the gene. Additionally, a vector of initial conditions b of length n is dened by genes of type 4 for values corresponding to internal state and phenotype based on those genes target sites, plus values set by the environment. We use this matrix and vector of initial conditions to dene an initial value problem by y 0 =b _ y =Ay: With the additional caveat that we prevent the vector of signals from ever being negative by setting any _ y value less than the input y value equal to that y value. This simply because gene expression levels must be non-negative in nature. We then use a numerical integrator to solve this problem over a given time period. The specic integrator used for these simulations is the midpoint method (i.e. Runge-Kutta order 2), and we integrate from 0 to 5 with a time step of 0.25. The midpoint method was chosen for it's reasonable accuracy and computational eciency, and the interval of integration and step size were chosen to 16 provide a reasonable number of time steps to use as phenotypic output for a gene regulatory network given initial conditions while not being too burdensome to calculate for thousands of individuals over tens of thousands of iterations in each of hundreds of simulations. Should these simulations eventually be modied to allow for the simulated evolution of net- works more closely resembling specic biological networks, a more accurate numerical integrator or more granular time steps over a larger interval may easily be used. The only concern would be computational time complexity. 1.4.4 Phenotype An organism's phenotype is dened as a set of vectors which each represent the individual pheno- typic values produced during a given iteration of the simulation. An organism may produce many such vectors in its lifetime if it is exposed to varying environmental conditions. These values are found in the appropriate subset of b specied in the previous section. 1.4.5 Reproduction Each iteration, some portion of the population is killed with probability proportional to tness. Individuals killed are replaced by iteratively choosing two parents who each oer a chromosome to their ospring so that population size remains constant. These new chromosomes are subject to point mutations as well as recombination which have the potential to change, duplicate, or delete genes. The proportion of the population replaced per iteration for all data presented here is 40%. 17 1.4.5.1 Recombination First, a parent copies its own chromosomes and determines the number of recombination \break- points" via a Poisson process distributed uniformly at random across each chromosome. It is possible for zero to be the number of breakpoints chosen, at which point one of the copied chro- mosomes is simply passed on to the ospring after experiencing point mutations. Recombination rate is per base per generation so that the likelihood of recombination at any site remains constant regardless of genome size. When a positive number of breakpoints is chosen, the two chromosomes are aligned with respect to a the centromere, and the sequences around each breakpoint on both chromosomes subject to the current alignment are compared. If they are suciently similar, then crossing over happens at that point. If they are not, then the chromosome arms are \bent back" one base at a time until either a match is found, or the maximal shift occurs. \Sucient similarity" is dened as there being a bitwise dierence of no more than 20% of bits between the six bytes surrounding the location of the breakpoint and the homologous six bytes around where crossing over would occur. Shifting cannot cross the centromere. If there is a match, crossing over happens at that point, and the next breakpoint is aligned with respect to the modied alignment the previous breakpoint produced. In this way, recombination both performs crossing over and introduces insertions and deletions (indels). If crossing over fails due to lack of sucient homology between the two chromosomes, the new organism is immediately killed and a new reproductive event is initiated. 1.4.5.2 Mutation After recombination, one of the new chromosomes is chosen randomly to be given to the new ospring. That chromosome then mutates at a number of sites which are chosen via a Poisson 18 process in a manner similar to how breakpoints are chosen. Each mutated site is changed to another base uniformly at random. 1.4.6 Fitness and selection An organism's tness is dened as the Euclidean distance between its phenotype vector and the environment's phenotype vector for the given environmental conditions. The lower this value, the more \t" the organism. Note that this is \tness" as it is typically used when discussing genetic algorithms. It is not tness as a measure of reproductive success, but rather indicates the relative likelihood of reproductive success in the context of the rest of the population. Reproduction and death are governed by a function which takes as input the tness of the organism and gives a value that is used to determine the likelihood of either event. These two values are calculated in dierent ways because the distribution of tness values is almost never symmetric, with the bulk of the population having tness values at or near the ttest, and then a long tail going out to less t individuals. Thus, we dene reproductive tness to be more sensitive to small dierences in phenotype than the tness corresponding to the likelihood of death. 1.4.6.1 Death In order to determine which individuals die each iteration, we create a list which is sorted based on tness plus a random value which is modied by a factor of the median tness of the population. Since a larger tness value is considered less t than one close to zero, we then take the last n individuals in the list where n is the number of deaths required and replace them. We can tune the eect of selection by changing how large a factor of median tness we use as the basis for the random term, which we call ks. A large ks means that the tness of an individual 19 needs to deviate by a large amount from the mean in order to meaningfully aect its likelihood of being chosen to die versus random chance. Conversely, a small factor means that the likelihood of an individual being chosen depends almost entirely on its tness. For the simulations presented here, we use a ks of 1. 1.4.6.2 Reproduction Some of the intuition behind how individuals are chosen for reproduction is similar to that of how they are chosen for death in that a list of individuals is made based on their tness modied by some factor of the median tness of the population. However, since we don't know ahead of time how many individuals are going to reproduce, it is likely for much of the population to have very similar tness, and we want to sample individuals with replacement for reproduction, the underlying data structure and algorithm is dierent. What we instead do is essentially split the interval [0; 1] into a number of bins equal to the number of individuals in the population with size proportionate to tness, and then sample randomly with replacement from those buckets. First we make a list of values by stepping through the population and calculating e (ss(1 fitness medFit )) where ss is a value which indicates the strength of selection, fitness is the tness of the given individual as dened above, and medFit is the median tness of the population. Once each value has been initialized, we nd the sum across the whole list and normalize the values by that sum. Finally, we step through the list in order and add the partial sum of all previous bins to the current bin. This results in a list of strictly increasing positive values that end in 1, where each value corresponds to an individual organism and the \gap" between a number and the next lowest 20 number is equal to that individual's normalized tness. The distance between that value and the next smallest value is larger for more t individuals, and smaller for less t individuals. The degree to which tness matters can be adjusted by ss. If ss is set to 0, then all intervals are of equal size. If ss is negative, a lower value for tness is selected for while conversely if ss is positive, then a higher value for tness is selected for. The magnitude of ss determines just how much tness matters. For these simulations we set ss to -10. This data structure allows us to select individuals randomly with probability proportionate to their relative tness by generating a random value in the range [0; 1] and then nding the individual who corresponds to the next largest number present in the list. 1.4.7 Environment The environment is represented by a vector of signals. This vector may remain constant through the course of a simulation, or may change. The magnitude of environmental signals aect how strongly an environment is able to drive gene expression, and also will aect any phenotypic signals being selected for that are sensitive to the environment. Various patterns of environmental change will aect whether and how organisms evolve to be sensitive to the environment. The range of possible environmental signals are specied by a le which contains at least as many rows of values as there are environmental signals. Each of these rows represents a distribution of values which may be sampled in order to produce the vector of environmental signals at any given point in time. There are four ways in which these distributions may be sampled. First, and simplest, the environment may remain constant. In this case, the vector of environ- mental signals are set to the rst element of the relevant rows in the given le. 21 Second, the environment may cycle, which means it starts at the rst value in each row and then every time the environment changes it moves to the next value in the row, wrapping around once it reaches the end. Rows may be of dierent lengths, which aect the period of the cycles. Third, the environment may change completely randomly. Each time the environment updates, a new value is chosen at random from a specied number of rows. Fourth, the environment may change randomly, but with linked signals. This is similar to the cycling environment except that rather than incrementing through the signals, a random point in the cycle is chosen. Note that this type is not used for any of the simulations presented in this work. For the latter two options, the environmental signal may be randomized either environment wide so that each individual in the population is experiencing the exact same environmental signal, or it may be randomized per individual. In the simulations presented, the environment wide context is used. If the environment is constant, then organisms are not going to be exposed to any information regarding which environmental signals should aect which phenotypes. If environmental signals are linked, then similarly some information will be hidden because when certain sets of signal values are always observed together along with phenotypic targets that are sensitive to the environment, it becomes more dicult to tell which of the signals is aecting the sensitive phenotype. The rate of change and the number of signals to change in the completely random environments, may be specied. Additionally, some of these signals may be used to aect the initial values for phenotypic signals in cells when numerically integrating. When such perturbations are used, we consider part of selection to be for homeostasis; that is, maintenance of phenotype in the face of external in uence. 22 1.4.8 Data Data was collected in two ways. First, every iteration a number of summary statistics were recorded. Second, population snapshots were periodically saved which contain detailed information about every living individual at the time the snapshot was taken. The following summary statistics were calculated across the population: length of the longest chromosome, the tness of the most t individual, the tness of the least t individual, the average population tness, the median population tness, the age of the ttest individual, the age of the oldest individual, the age of the youngest individual, and the average age of individuals in the population. Population snapshots are formatted provide sucient information to completely reconstruct the simulation at the iteration they were produced, as well as some additional information about the past of each individual. At the level of environment, the information saved is the number and possible values of signals present and the population size. At an individual level, we save the following information: age, tness, tness trajectory, pair- wise distance trajectory, phenotypic trajectory, environmental signal trajectory, target phenotype trajectory, generation, genes, genome, and number of ospring. Fitness trajectory is a list of tness values for every instance that it has been evaluated in the organism's lifetime. Phenotypic trajectory is a list of each set of phenotype values an organism produced at each time point in its lifetime. Environmental signal trajectory is a list of all environmental signal vectors an organism was exposed to in its lifetime. 23 Target phenotype trajectory is a list of each target phenotype value an organism's phenotype is compared to in order to produce tness. Pairwise distance trajectory is the Euclidean distance between the organisms phenotype values and the target phenotype values. Each tness value in the tness trajectory is equal to the sum of these for a given iteration. The generation to which an individual belongs is dened as one more than the greatest generation of an individual's parents. For each gene, we save the gene's basal expression rate, the specic value it targets, its binding site, how many other genes it in uences, whether or not the gene is used, whether we are imposing some external regulation which is used to study the eects of perturbing a network, the genomic location of the gene, how much the gene is being naturally regulated at that point in time, the expression level of the gene, all binding sites the gene can interact with since we allow for some degree of non-specic binding, the strength with which the given gene regulates its targets per unit of expression, and the maximal value at which the gene can be expressed. 1.5 Results In this section we outline a few high-level results to verify that the simulation outlined above does indeed model populations of individuals that evolve gene regulatory networks which approximately produce the phenotypes described in the previous section, with more detailed analysis of some features of the evolutionary process being covered in subsequent chapters. 24 1.5.1 \Experiments" We generated 1440 data sets by running ten replicate simulations per 144 dierent combinations of environment type and mutational contexts. Mutation and recombination rates each varied by 1e 3 , 1e 4 , and 1e 5 while four dierent environment types and four dierent targets of selection were used. We will refer to the environment types as: constant, cyclical, random, and random with selection for homeostasis (rand hom for short). All environments presented 1 signal value for organisms to sense, and organisms produce 3 phenotypes. Each experiment was seeded with a population of 5; 000 individuals, all initialized with the same genome. This genome codes for 60 genes, 18 of which aect phenotype in a way that results in tness worse than if no genes were expressed. This provides an initial basis for mutation to aect phenotype and thus selection to promote variants that have immediate advantage. Population size was held constant for all simulations, with 40% of the population being replaced with new individuals via birth and death processes outlined in the methods section. The values for ks and ss were 1 and10 respectively. Simulations were allowed to evolve initially for 5; 000 iterations which will be discussed in this chapter and the next, with another 5; 000 subsequent iterations for chapter 3 resulting in 10; 000 iterations total. We collected summary statistics for every iteration, and population snapshots every 1; 000 iterations. 1.5.1.1 Environmental Signals Table 1.1 shows the environmental signals used as described previously. Recall that the rst line is the one which species environmental signal. Constant environments always send 0.1 as their signal. Cyclical environments step through 0.1 to 1 with a step size of 25 environment 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 phenotype 1 0.2 0.4 0.6 0.8 1 0.9 0.7 0.5 0.3 0.1 phenotype 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 phenotype 3 0.2 0.4 0.6 0.8 1 0.9 0.7 0.5 0.3 0.1 Table 1.1: Values used to set environmental signals and phenotypic perturbations 0.1 before wrapping back around to 0.1. Random environments change to a random value in the rst row at every iteration. The remaining lines are only used when we are perturbing the initial values used in numerical integration. The second row perturbs the value associated with the rst phenotypic signal, the third row perturbs the second phenotypic signal, and the fourth row perturbs the third phenotypic signal. Note that in all simulations considered in this document, these perturbations are only used when rows are randomly sampled independently from one another. 1.5.1.2 Targets of Selection Three of the targets of selection were dened by matrices, while one was dened by a set of equations. The matrices (to two decimal places) and their output are dened as follows where the right column vectors are the initial conditions passed into the system of dierential equations modeled by the given matrices. The last row/column of each matrix represent interaction with the single environmental signal, and then the next 3 columns back from the right and rows from the bottom represent phenotypes that are selected for. The remaining values (if they exist) are \hidden" from selection. The last row is always zero because organisms cannot in uence their environment. 26 Figure 1.1: Phenotypic output selected for using t3 for the environmental signal given above each plot. Only phenotype 2 is sensitive to the environment. First is the target we refer to as t3. It is dened by t3 = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0 0:19 0 0:03 0 0 0:16 0 0 0 0:09 0 0 0:63 0:50 1 0 0 0 0 0:34 0 0 0 0:92 0 0 0:27 0 0 0 0 0 0 0 0 0 0 0 0 0 0:03 0 0 0:61 0 0:93 1 0 0 0 0 0 0 0:51 0 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 1 1 1 1 1 1 +e 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 And the resulting phenotypic trajectories over time are shown in Figure 1.1. 27 Figure 1.2: Phenotypic output selected for using t4 for the environmental signal given above each plot. Only phenotype 1 is sensitive to the environment. Next is the target t4, which is dened by t4 = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0 0 0:65 0:69 0 0 0:5 0 0 0:81 0 0 0:13 0 0 0 0 0 0 0 0 0:76 0 0:20 0 0 0 0 0 0 0 0 0:63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 1 1 1 1 1 +e 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 The phenotypic trajectories this produces are shown in Figure 1.2. 28 Figure 1.3: Phenotypic output selected for using sin for the environmental signal given above each plot. All phenotypes are sensitive to the environment. The third target for selection using a matrix, sin, is dened by sin = 2 6 6 6 6 6 6 6 6 6 6 4 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 4 1 0 0 1 +e 3 7 7 7 7 7 7 7 7 7 7 5 Figure 1.3 shows the phenotypes this matrix produces. The last target, referred to as hm, is dened by the following three equations p 1 (t) =et; p 2 (t) =e; p 3 (t) = e t + 1 ; 29 Figure 1.4: Phenotypic output selected for using hm for the environmental signal given above each plot. All phenotypes (including the initial values of phenotypes 2 and 3) are sensitive to the environment. where p n (t) is the value of the given phenotype given time and e is the environmental signal. Figure 1.4 shows the output for hm. Recall that all these targets these are evaluated for each time point in [0, 5] using step sizes of 0.25. Targets t3 and t4 were chosen as two targets that each have two phenotypes that are expressed independently from the environment, with a third phenotype that is sensitive to the environment. As shown in Figures 1.1 and 1.2, the two environmentally insensitive phenotypes have similar curves between t3 and t4, while their environmentally sensitive phenotypes are markedly dierent; the phenotype int3 that is sensitive to environmental signals (phenotype 2) is very strongly aected, while in t4, phenotype 1 is only slightly sensitive to the environment. The sin target, shown in gure 1.3, was chosen to have a more \dicult" to reach set of phenotypic trajectories, all of which are aected in magnitude by the environment. 30 Figure 1.5: Solid lines are phenotypic output produced by an individual taken from the nal gener- ation of a population adapted to produce the t3 phenotype for the environmental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 Figure 1.6: Solid lines are phenotypic output produced by an individual taken from the nal gener- ation of a population adapted to produce the t4 phenotype for the environmental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 31 Figure 1.7: Solid lines are phenotypic output produced by an individual taken from the nal gener- ation of a population adapted to produce the sin phenotype for the environmental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation rate at at 10 3 and recombination rate at 10 5 Finally hm was chosen as another \dicult" target to evolve towards due to the fact that the set of equations which yield the target phenotypes have varying initial conditions depending on the environmental signal for phenotypes 2 and 3, which is not possible for organisms in this simulation to imitate. Figure 1.8 shows one adaptation where the initial values evolved are roughly halfway between the lowest and highest initial values specied by hm. The examples shown in Figures 1.5-1.8 demonstrate that very close approximations of each of these targets can be achieved by this evolutionary model, even for signals that the organisms never directly experienced (e = 0). This is why in Figures 1.5 and 1.6, the only noticeable deviation from the target is at e = 0, and that is the value at which Figure 1.7 deviates the most. 32 Figure 1.8: Solid lines are phenotypic output produced by an individual taken from the nal generation of a population adapted to produce the hm phenotype for the environmental signal indicated above each plot. The dashed lines are the exact curve being selected for each phenotype. This population evolved with a cyclical environment and mutation and recombination rates both at 10 3 1.5.2 Trajectories 1.5.2.1 Fitness Populations generally reach peak tness within1; 000 iterations. For example, Figures B.1-B.4 show this for the simulations evolving toward t3 in each environment type (refer to appendix B for gures referenced with the prex B in the main text). Mutation and recombination rates seem to matter more than environment type in terms of how quickly a population produces individuals with peak tness, though environment type does still have a noticeable eect where more uncertainty seems to promote a more rapid convergence e.g. in Figure B.4 where homeostasis is being selected. This eect is even more marked in Figures B.5-B.8 which indicate the population average tnesses for the same data set. These patterns are consistent within each of the other targets of selection. See Figures B.9-B.16 for t4, Figures B.17-B.24 forsin, and Figures B.25-B.32 for hm. 33 1.5.2.2 Chromosome Size Chromosome sizes are similarly in uenced by the interactions between environment, targets of selection, mutation, and recombination (see Figures B.33-B.48 for all plots of chromosome size). At low levels of recombination and mutation rates, genome size is fairly constant across all environments and targets of selection, with a few rare exceptions. However, when mutation and recombination rates are 10 4 or greater, we see noticeable variation accrue in the size of genomes, with the greatest variation commonly occurring at mutation and recombination rates both at 10 3 . Note, however, the greatest length occurs at a recombination rate of 10 3 but mutation rate of 10 4 with the hm target of selection and the random environment with phenotypic perturba- tions (Figure B.48). The second greatest length outside of that treatment occurs with the same environment, target of selection, and mutation rate, but with recombination rate also at 10 4 . Also of note is that generally when chromosome size changes, it initially increases and then levels out and sometimes decreases. Very rarely do we see an immediate drop in chromosomal length. This rise and fall consistently occurs when both mutation and recombination rates are both 10 3 , and is \sharpened" by more stressful environments (especially the random environment with phenotypic perturbation) and the sin and hm targets of selection. 1.5.2.3 Gene Count Figures B.49-B.64 show the trajectories for the number of genes present in the ttest individual for the given simulations over 5; 000 iterations. Variability in the number of genes is most strongly correlated with mutation rate. Interestingly, the mean doesn't change too much across almost any simulation, but there is clearly a lot of variation within a consistent range, indicating that there is regular turnover in genes. 34 1.6 Discussion 1.6.1 Detectable dierences in evolution All of the simulations considered were able to adapt to some degree to their environment and produce an approximation of the phenotype they were selected to produce. These phenotypes are driven by a network of genes, and these genes clearly undergo a fair amount of turnover and selection given the variation in number of genes that occurs over time within runs, as well as the size of chromosomes. Chromosomal size being most strongly aected in populations evolving in \stressful" conditions is interesting. It is likely that the initial increase in chromosome size is associated with increasing the likelihood of generating new genes, either through duplications or simply because mutation rates are per-base and so more bases means more places for genes to arise de-novo. After early adaptation occurs, this extra size may then become a liability, and so selection acts to reduce the amount of \DNA" not involved in producing the adaptive phenotype. Unfortunately, much of this plays out between the rst and thousandth iteration; a time frame for which we did not save population snapshots. Exploring the dynamics of early adaptation in such simulations is an exciting future direction. 1.6.2 Conclusion We have demonstrated that our simulation framework is able to represent a wide array of evolu- tionary situations and reliably produce populations that adapt to the constraints in which they reproduce. This sets the stage for our next chapters where we will use our population snapshots to analyze statistics of the gene regulatory networks that have evolved to determine the degree to 35 which such statistics provide evidence of the mutational and selective forces shaping the evolution- ary history of our simulated populations. 1.7 Code Availability Source code for grn-evo is available at https://bitbucket.org/conow/grn-evo/src/master/ 36 Chapter 2 Characterizing the utility of network statistics as a means of inferring evolutionary forces 2.1 Introduction Patterns of selection in nature are identied by studying distributions of phenotypes and genetic factors associated with them. It stands to reason, then, that selection must also in uence the structure of networks that result from interactions between genetic factors that in turn give rise to phenotypes. Many studies have shown that high-level network statistics found in biological networks dier from what one would expect if those networks had formed completely at random[41, 42, 43], however it is uncertain how many of those features can be attributed to selective forces specic to the circumstances under which a network evolved as opposed to more universal pressures and processes which are applicable to gene networks in general[44, 45], and even when a compelling case can be made for adaptation, determining the adaptive benet of a structure can be dicult[46]. With the simulations described in the previous chapter, we aim to begin to unravel such questions by building networks from the genes possessed by individuals, and generating statistics on those 37 networks. We then explore the degree to which those statistics can be used to classify networks into the appropriate environmental, selective, and mutational contexts in which they evolved. 2.2 Methods 2.2.1 Network Construction Constructing networks from a list of genes is fairly straightforward. First, genes that have zero regulatory impact on their downstream targets are discarded. Next, remaining genes which have either a basal expression level greater than zero or whose expression is in uenced by environmental signal are used to initialize a list of \visited" genes. From these genes, a breadth-rst search is performed over the remaining, unclassied genes, adding their downstream targets to the list of visited genes until the whole graph has been explored, considering phenotypes to be terminal nodes when genes which in uence them are found. Finally, the list of visited genes is pruned by returning to the initial list of genes (i.e. those which are in uenced by environment or have a basal expression level) and enumerating all simple paths from those genes to all in uenced phenotypes, keeping only genes which lie on those paths. The above process results in directed networks that consist only of those genes which receive some input signal and are able to eventually in uence at least one phenotype. Genes which are never activated or cannot ever in uence a phenotype are ignored. 2.2.2 Statistics Collected Once a network is created, the NetworkX Python package[47] was used to calculate the following statistics on the directed version of the graph: number of edges, number of nodes, mean in/out 38 degree, and density. Using the same package, the following statistics were calculated on the undi- rected version of the graph: degree assortivity, mean degree, eciency, density, clustering coecient, modularity, and number of communities. In the case of statistics calculated exclusively on the undi- rected version of a network, this is due to either unreasonable computational complexity required to calculate them on the directed version, or there being no algorithm provided that accepts a directed graph. Additionally the mean pleiotropy of nodes (genes) was calculated for each graph, where pleiotropy is the number of phenotypes reachable from a given node, was calculated as well as the number of genes identied as \used" or \unused" based on the above network construction algorithm. 2.2.3 Analysis and Classication Principal component analysis (PCA) was used to visualize the degree to which these statistics dier between simulation runs, and random forest classiers were trained on the data to determine how well we are able to classify runs into their correct environment, target of selection, or mutational context based on the collected statistics. These were performed in R with PCA using the default routine that comes packaged with R[48], and random forests using the randomForest package[49]. 2.3 Results Due to the computational complexity of calculating network statistics and the sheer number of networks the simulations produce, network data were collected from only 50 individuals in each population snapshot rather than the full population of 5; 000 individuals. These individuals were chosen by sorting the population by number of ospring produced per individual, breaking ties by then comparing their \tness" value at the given iteration as dened in Chapter 1, and then taking 39 Figure 2.1: OOB error estimates as trees are added to a random forest ensemble training to identify targets using data aggregated across whole runs. the top 50 in that sorted list. This ensures that a representative sample of individuals from each population snapshot were collected which possess functional networks that produce the selected phenotype as well as the population was able to at that point in its evolutionary history. To train classiers and interpret their output, data were processed in a few ways. All random forest ensembles were trained with 200 trees, which was found to be sucient to reach a stable level of classication (Figure 2.1 as an example). The default number of variables to randomly sample as candidates at each split was used ( p p wherep is the number of variables), modication of which was found to not signicantly eect the accuracy of the classier. The rst batch of random forest ensembles were used to classify individuals into environment types, targets of selection, mutation 40 rates, and recombination rates by simply randomly sampling 70% of individuals from the collected data to create training sets and with the remaining 30% acting as validation sets. The classiers all perform with roughly 95% accuracy or better (see left side of Table 2.1) and the validated error rate closely match the out-of-bag (OOB) estimates that the classiers generate while training (within 1-2%). There are issues with sub-sampling in this way, however. Since individuals are not all independent observations, this process may be recognizing individuals who come from the same iteration of the same population and thus are extremely similar, eectively testing trees against individuals nearly identical to those were used to build them in the rst place, resulting in extremely low rates of error. In other words, this is a lot like testing the classiers against their own training set with a small amount of noise added. In order to determine whether this is the case, we re-run the process but instead of randomly sampling individuals, we randomly sample by runs. In practice since we have 10 replicate runs per unique treatment we assign a value from 1 to 10 to each run so that no two runs from the same treatment have the same run number. Then we generate seven random numbers and take all individuals from runs whose number matches one of those randomly generated, with the individuals from the remaining runs used as the validation set. This means that no individuals used in the validation set are related to any individuals in the training set except by the environmental, selective, and mutational contexts under which they evolved. The results of testing validation sets on these classiers are shown on the right side of Table 2.1. The resulting OOB estimates of training in this way did not change compared to sampling by individual, indicating that the dramatic change in training accuracy is not due to chance, and the high discordance between OOB estimates and actual testing of the validation set indicate that the classiers are tting to patterns in the training data that don't exist in the validation sets. 41 Subsetting per Individual Run Environment constant 25640 455 565 481 0.055 9377 5610 5768 6245 0.65 cycle 612 25075 672 346 0.061 7125 7663 7363 4849 0.72 rand 571 675 25139 488 0.063 6492 7504 6775 6229 0.75 rand hom 256 224 297 26545 0.028 3218 1768 2328 19686 0.27 Target phenotype hm 26077 205 277 226 0.026 21625 1530 1695 2150 0.20 sin 201 26463 197 134 0.020 4248 17434 2804 2514 0.35 t3 308 261 26091 586 0.042 4495 2670 15432 4403 0.43 t4 244 274 806 25651 0.049 3085 2365 6522 15028 0.44 Mutation rate 10 3 34342 1319 199 0.042 28509 6483 1008 0.21 10 4 1394 34546 337 0.050 14666 17522 3812 0.51 10 5 235 453 35176 0.019 6970 14106 14924 0.59 Recombination rate 10 3 34236 1070 701 0.049 20996 9783 5221 0.42 10 4 1050 33739 1227 0.063 13080 14942 7978 0.59 10 5 435 715 34828 0.032 9511 10834 15655 0.57 Table 2.1: Error matrices when testing validation sets against random forest classiers where test and validation sets are subset in a specic way. In the \Individual" column, we randomly sample 70% of individuals across the whole dataset for the training set, with the remaining 30% being used as the validation set. In the \Run" column, we subset by runs so that all observations from 70% of runs are used in the training set while all observations in the remaining 30% of runs are used in the validation set. The factor being used to train a given classier is shown on the left, with the specic classes that factor has just to the right. This gives us eight classiers shown, where each row before the dashed line has the number of times observations in the test set were classied as a given type (i.e. values in the rst column represent the number of times an observation labeled by the class in a given row was predicted to be the rst class, and so on). Diagonals represent correct classication. The values to the right of the dashed lines represent the error associated with classifying observations who belong to the class in the given row. 42 While random forest ensembles are dicult to over-t, these results could be explained by the classiers over-tting on features that are specic to individual runs and as a result are poor at classifying individuals which come from runs that the training routine never \saw". In order to verify that this is not the case, we calculated summary statistics for entire runs based on the 50 individuals sampled per iteration. For each statistic, described in the methods, we found the maximum, minimum, mean, median, variance, coecient of variation, and range (maximum minus minimum) for the entire run. In this way, we can train the classier using single observations from each run, eliminating the chance that the classier will over-t to specically identify which run an individual comes from. The results of this are in Table 2.2. OOB estimates of error when classifying for environment and target phenotype with these aggregate statistics match very closely with the error found in validating the classiers trained on individuals sub-sampled by run (compare Table 2.2 to the right side of Table 2.1). OOB estimates of error for classiers based on rates of recombination and especially mutation trained on aggregate statistics are less concordant with, and generally lower than, the error found in classifying based on those factors using individuals sub-sampled by run. 2.4 Discussion Table 2.1 shows that there is a strong \run" eect when it comes to network statistics and pre- sumably network structure. As initial adaptation occurs, it seems certain features arise and spread through the population that fall within the space of adaptive network solutions for the problem of producing a phenotype given environmental context. These features are equally adaptive, but have quantitatively dierent structure. As a result, further evolution along each lineage inherits that initial adaptive structure, and all resulting network adaptation and drift act to modify that 43 Environment constant 134 93 88 45 0.63 cycle 88 102 118 52 0.72 rand 101 115 98 46 0.73 rand hom 33 32 31 264 0.27 Target phenotype hm 286 34 22 28 0.21 sin 34 263 25 38 0.27 t3 34 30 206 90 0.43 t4 37 28 91 204 0.43 Mutation rate 10 3 320 16 0 0.048 10 4 38 271 27 0.19 10 5 0 30 306 0.089 Recombination rate 10 3 271 109 100 0.44 10 4 105 233 142 0.51 10 5 64 120 295 0.39 Table 2.2: Confusion matrices estimated by out-of-bag (OOB) events observed during training of a random forest ensemble on summary network statistics on each of the 1440 simulations run (i.e. one observation per run of the simulation). The factor being used to train a given classier is shown on the left, with the specic classes that factor has just to the right. This gives us four classiers shown, where each row before the dashed line has the number of times observations in the automatic OOB tests were classied as a given type (i.e. values in the rst column represent the number of times an individual observation by the class in a given row was predicted to be the rst class, and so on). Diagonals represent correct classication. The values to the right of the dashed lines represent the error associated with classifying observations who belong to the class in the given row. 44 structure. At the same time, these descriptive combination of features are much less strongly tied to the specic environment, phenotypic selection, and mutational forces experienced by a given run. This means that while individual runs are identiable from the combination of network statistics which describe them, that combination of statistics are much less descriptive of the selection that the population experienced (left and right sides of Table 2.1 respectively). Figure 2.2 provides a visual representation of this. Columns show targets of selection and rows show environments. The PCA was calculated with the full data set, so each subplot is shown us- ing the same principle components. Color is specied by unique runs within a given subplot, and are re-used between subplots so they do not indicate any relationship across dierent environmen- tal/selection contexts. We see clusters of colors showing that individual runs occupy a constrained sub-space of the overall distribution that the replicate runs cover in aggregate. However, comparing the two right most columns of this gure, it is clear that the t3 and t4 targets produce combina- tions of network statistics per environment (row) more similar to one another than to the other two targets. This matches up with the fact that the random forest classiers trained to classify by target type \confuse" targetst3 andt4 three times more frequently than with or between any other target. We argue that this lends further evidence to the idea that gene networks occupy a generally \neutral" space of congurations as long as they produce a viable phenotype and speculate that selective pressures aect the boundaries of that space rather than directly acting to cause a dier- ent mean value to be more adaptive. This ties into studying cryptic variation, and is a potentially exciting future direction to explore with this simulation framework. When we trained random forests on aggregate statistics calculated from individuals across a whole run (Table 2.2), we found that the classiers for environment type and target of selection are essentially just as accurate as when we train and test random forests on the individual based data 45 Figure 2.2: Principle component analysis using network statistics of the 50 ttest individuals per iteration per run for the full dataset. The loadings for each subplot are identical and calculated with the full set. The plot is faceted in order to make comparisons between and among target types and environment types more clear. Coloring is by unique run per subplot, and is only indicative of a relationship between points within a given subplot rather than between subplots. set while sub-setting by run (Table 2.1, right side). This means that the error in classication in the individual data set is not due to over-tting on run-specic features. Since the accuracy in classifying targets of selection vary from 60-80%, this supports the idea that high-level network statistics have some, but limited, use as a way to argue that two networks evolved to \solve" similar phenotypic problems. On the other hand, classifying by environment is generally not much better than guessing for three of the environments we used (constant, cycle, and rand), while the rand hom environment is more correctly classied. 46 Mutation and recombination rates, on the other hand, show evidence of over-tting to the training set resulting in lower accuracy in predicting the validation set when training and testing with individual data subset by runs (Table 2.1 right side) when compared to training on aggregate data per run (Table 2.2). Mutation rates in particular are classied with reasonable accuracy (5- 10%) when trained on the aggregate data set as opposed to 20-50% in the classier trained on individual data. We believe that a few things are going on here. First, the highest rate of mutation is always the most accurately classied. This likely has to do with our observations in Chapter 1 that populations with high mutation rate reach a very t solution much more quickly and consistently than with lower mutation rates, and in conjunction with the more accurate classication of network statistics indicates that the space of highly t networks is probably more constrained than the space of somewhat less t networks. As an aside, this also demonstrates the value in having a model framework that allows for variation in the output of a network rather than requiring all networks to produceexactly a single pattern of output given an input in order to successfully reproduce. Another possible hypothesis to explain lower mutation rates being less readily recognized is that there may be some feature of networks that continues to evolve and adapt even after further improvements in tness have slowed down, while at higher mutation rates those features nish adapting much more quickly. Further investigation will be required. Another possible explanation for the dierence between accuracy in classication of mutation rate depending on the form of the data used (individual or aggregate) to train the classier is that over time, mutations are the direct factor which change networks. When we classify by individuals across whole runs, we have no information in terms of how much change individuals in that population have experienced compared to their ancestors, nor how much variation is present within a population. If the space of network statistics is more constrained by environment type and 47 phenotypic selection, then mutation rate does more to aect how much a population drifts within that space than it does to aect where in that space an individual's network statistics lie. In the aggregated statistics used, the classier now has information regarding the distribution of values an individual run explores over evolutionary time, which would explain why such classiers are more accurate as more drift occurs, dening more consistently the boundaries in which evolution is constrained by selection. One last hypothesis is that higher rates of mutation are simply more \harsh", and so networks which experience high rates of mutation evolve to be structured in a way that provides robustness to mutational perturbation. Similar patterns are observed for recombination, though to a much lesser degree. This is likely due to the fact that recombination only has an eect when it occurs in an individual who is heterozygous for some portion of its genome. 2.4.1 Conclusion In this chapter we have shown that network evolution is more complex than can easily be charac- terized by summary statistics, and nd support for previous studies which argue that the structure of networks may not on their own provide denitive evidence of network function and selective pressures[2, 22, 23, 24, 25, 26, 44, 45, 46]. More specically, we have demonstrated that our simula- tions allow for the investigation of how environmental contexts, phenotypic selection, and mutation aect the range and type of network structures that evolve in response to such pressures. Since we see variation in how well random forest ensembles classify data sets dependent on the aforemen- tioned pressures, this simulation framework can be used to study in more detail specic network 48 motifs that underlie higher-level statistics. Such results may be used to generate testable hypothe- ses for patterns we expect to see in nature as new datasets and methodologies are developed to construct and annotate real biological networks. Future work with this dataset to enumerate in more detail the degree to which specic statistics aect the accuracy of classication algorithms, as well as how combinations of environment type, phenotypic selection, and mutational rates aect that accuracy are exciting and readily explored possibilities. 49 Chapter 3 Examining preliminary results and analysis of comparing stable populations against their past selves 3.1 Introduction Though conserved gene network modules and motifs are commonly found between and across many organisms[50, 51, 52, 53], it is not well known in general how quickly and freely real biological networks \re-wire" themselves under stable selection over large timescales[15, 50, 52, 54, 55] though this process contributes to cryptic variation which is known to play a vital role in adaptation[29], nor is it known the degree to which we would expect features to be conserved despite substantial theory and evidence of genetic drift and conservation at the sequence level[56]. The previous chapter demonstrated that while certain network statistics can be conditionally informative as features to use in inferring patterns of selection in uencing a population's evolution, those statistics are even more indicative of the unique evolutionary history of each population. The next logical step is to investigate to what degree those features drift over time, and if the rate of such drift is in some way aected by selective forces. 50 Preliminary data and analysis are presented and discussed in this chapter, demonstrating both that networks continue to evolve and change neutrally in this simulation, and that the rate and nature of these changes can not only be aected by mutation rates, but by environmental pressures and phenotypic selection as well. The populations discussed in the previous chapters were allowed to evolve for an additional 5; 000 iterations in order to characterize the degree to which the random forest ensembles lose their ability to correctly classify individuals as time passes from the generation used to train the random forest ensembles due to network structure drifting over the intervening generations. 3.2 Methods We used nal population snapshots from the simulations discussed in the previous two chapters to \start" new simulations which eectively pick up where the previous simulations left o, and evolve them for another 5; 000 iterations. All parameters and manner of data collected remain the same as in the rst 5; 000 iterations, resulting in populations that have now evolved for a total of 10; 000 iterations. Similar to chapter two, we trained random forest classiers on network statistics calculated from the 50 ttest individuals in each population at each iteration, but in contrast we calculated aggregate statistics on a per iteration basis rather than the entire run. This allows us to train classiers on \early" iterations to see how well they are able to classify later iterations. We expect that as more iterations occurred between the training set and the test set, the performance of a classier will degrade. We trained additional classiers on subsets of the full set of evolutionary contexts such that each pairwise combination of factors (i.e. environment type, target of selection, 51 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 1379 159 168 94 0.23 299 18 26 17 0.17 243 42 40 35 0.33 cycle 188 1264 236 112 0.30 32 275 34 19 0.24 50 204 60 46 0.43 rand 186 222 1267 125 0.30 23 29 284 24 0.21 45 60 211 44 0.41 rand hom 130 122 106 1442 0.20 21 15 15 309 0.14 19 20 28 293 0.19 Target phenotype hm 1570 61 84 85 0.13 304 17 15 24 0.16 276 37 21 26 0.23 sin 75 1558 78 89 0.13 5 339 9 7 0.058 14 324 11 11 0.10 t3 88 76 1455 181 0.19 9 17 297 37 0.18 16 27 274 43 0.24 t4 83 70 232 1415 0.21 20 19 30 291 0.19 28 22 61 249 0.31 Mutation rate 10 3 2209 189 2 0.080 455 25 0 0.052 453 26 1 0.056 10 4 226 1970 204 0.18 71 397 12 0.17 94 383 3 0.20 10 5 0 124 2276 0.052 0 95 385 0.20 4 183 293 0.39 Recombination rate 10 3 1878 320 202 0.22 413 43 24 0.14 365 73 42 0.24 10 4 297 1711 392 0.29 51 366 63 0.24 76 322 82 0.33 10 5 173 363 1864 0.22 23 59 398 0.17 40 97 343 0.29 Table 3.1: Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of a simulation are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the four classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. point mutation rate, and recombination rate) were considered in order to identify whether higher level trends in \ease" of classication are sensitive to more specic contexts. 3.3 Results Classiers trained on iterations 1; 000 to 5; 000 are reasonably accurate at classifying data from iteration 6; 000, generally meeting or even beating the OOB estimate from the training set though there are a couple exceptions, notably for mutation rate of 10 5 (Table 3.1). This aligns with the results of Chapter 2. At 10; 000 iterations, the classiers all lose some accuracy compared to their predictions of pop- ulation data at 6; 000 iterations. At 10; 000 iterations, there are only three classes that are classied better than the OOB estimate of error generated during training on the rst 5; 000 iterations. They are the rand hom environment, sin target phenotype, and mutation rate of 10 3 . Also of note, 52 recombination rates of 10 3 and 10 4 as well as mutation rate of 10 4 have error less than 5% above the OOB estimate. Further subsetting these data reveals more detailed patterns of how dierent simulation pa- rameters interact to produce distinctive patterns of network structure. The results for training random forest ensembles on all subsets of factors are included in Tables A.1-A.14 at the end of this document. The following subsections are all focused on the error associated with classifying data from the 10; 000th iteration using classiers trained on the rst 5; 000 iterations. 3.3.1 Environments With one exception, the rand hom environment is the most consistently identied at 10; 000 itera- tions regardless of how data is subset. The patterns are less clear for other environment types. When data is subset by target, (Tables A.5-A.8), the most noteworthy result is that when populations evolve toward thesin phenotype (Table A.6), rand hom is not the most readily classied environment and instead is classied with similar accuracy as the cycle and rand environments. Interestingly, with this data subset, the constant environment is the most accurately identied. With the hm target (Table A.5), conversely, the rand hom environment is incorrectly classied about 3% of the time, which is by far the most accurate classication of an environment after 10; 000 iterations for any data subset. The next lowest lie in the 10-15% range in a few cases. Another notable result when subsetting by targets is that the cycle environment is incorrectly classied 18% of the time when selecting for the t3 phenotype (Table A.7) and 41% of the time when selecting for the t4 phenotype (Table A.8), while the other environments are classied at roughly the same accuracy between the two targets. 53 When subsetting by point mutation rate and classifying environments, there are two, related, noteworthy results. First, when mutation rate goes from 10 3 to 10 4 (Tables A.9 and A.10), the error associated with a random forest ensemble's classication for the cycle and rand environments go from about 25% to 50%, while the constant environment goes from 27% to 36%. The ability for a random forest ensemble to identify the rand hom environment, however, stays roughly the same at 12%. At mutation rate of 10 5 (Table A.11), the error rates for classifying the constant, cycle, and rand environments are essentially the same as at a mutation rate of 10 4 , but the error associated with classifying the rand hom environment goes from 12% to 23%. Finally, when subsetting by recombination rate (Tables A.12-A.14), there is essentially no dif- ference between recombination rates of 10 3 and 10 4 for any environment. A rate of 10 5 results in slightly more error while classifying the constant, and the cycle environment being much more error prone. 3.3.2 Targets Broadly, the sin target is always the most readily identied, while targets t3 and t4 are often confused with one another across all treatments. hm is more often confused with sin than either t3 or t4, though this is not always the case. A couple interesting outcomes appear when random forest ensembles are trained on data subset by environment. First, classifying the hm target is much more error prone in the rand environment at 37% (Table A.3) compared to training and classifying on data restricted to other environments, which fall in the range 14-22% (Tables A.1, A.2, A.4). Second, t4 is misclassied more often than t3 across all environments, though when considering data from the cycle and rand environments they are close, and the rates of error for t3 are always around 20% regardless of environment. 54 The results of classifying targets of selection through random forest ensembles on populations subset by mutation rate (Tables A.9-A.11) follow a similar pattern as classifying environments on the same data subsets. hm, t3, and t4 are each classied with similar error with mutation rates 10 4 and 10 5 , which are all much higher than with mutation rate 10 3 . sin on the other hand, is classied with error < 1% when mutation rate is 10 3 , 5% when mutation rate is 10 4 , and 12% when mutation rate is 10 5 . In contrast to environments, when considering data subset by recombination rate, classiers trained on higher rates of recombination tend to be more accurate than those trained on lower rates of recombination (Tables A.12-A.14). 3.3.3 Point Mutation Rates Generally a per base pair mutation rate of 10 3 is the most readily classied at 10; 000 iterations. Also, 10 4 is often classied with similar accuracy to either 10 3 or 10 5 . More specically, when data is subset by environment, random forest ensembles trained on the constant, cycle, and rand environments all perform relatively similarly when classifying mutation rates (Tables A.1-A.3). These forests have less than 5% error when classifying data coming from the 10; 000th iteration of simulations experiencing a mutation rate of 10 3 , 24-30% for a mutation rate of 10 4 , and 34-50% for a mutation rate of 10 5 . However, in the rand hom environment, data coming from populations experiencing mutation rates of both 10 3 and 10 4 have similar error at around 10%, and error of 28% for data coming from populations experiencing a mutation rate of 10 5 . When classiers are trained on data subset by target of selection, targets t3 and t4 (Tables A.7 and A.8) follow a pattern wherein populations which experience a mutation rates of 10 3 are 55 improperly classied 1-3% of the time, 10 4 18% of the time, and 10 5 36-41% of the time. The sin environment impacts the classication of mutation rates (Table A.6) in a manner similar to to the rand hom environment where the error rates for classifying data into mutation rates of 10 3 and 10 4 are very close while classifying 10 5 is much more error prone. Last, populations evolving toward the hm phenotype (Table A.5) produce data where random forest ensembles trained to classify based on mutation rate are very accurate at classifying the 10 3 mutation rate with just under 1% error, while mutation rates of 10 4 and 10 5 are classied with 23% and 28% error respectively. Finally, when data is subset by recombination rate (Tables A.12-A.14), the pattern of higher mutation rates being more readily classied than lower holds, and lower rates of recombination result in higher rates of error for the classiers trained on those data sets. 3.3.4 Recombination Rates While overall 10 4 is the least accurately classied recombination rate, this does not hold for all data subsets considered. When training and classifying on data coming from constant environments (Table A.1), 10 5 and 10 4 per base recombination rates are the least easily classied, while the 10 3 rate is classied with about 10% greater accuracy. The cycle and rand hom environments (Tables A.2 and A.4) produce populations in which rates of 10 3 and 10 5 have approximately equal error; 20% each in cycle and 28% in rand hom. Additionally, when classifying the 10 4 recombination rate, the cycle environment data is misclassied 31% of the time and the rand hom environment data 38% of the time, which is about 10% higher within each environment than the other two rates. In the rand 56 environment (Table A.3), the error associated with classifying rates of 10 3 and 10 4 were similar at around 30%, while the error associated with classifying a rate of 10 5 was 10% lower at 20%. Classiers trained on subsets of data where populations evolve toward the hm,t3 andt4 targets (Tables A.5, A.7 and A.8) behave similarly when classifying recombination rates. The rate 10 4 is the least accurately classied, with 31% error for data sets corresponding to target t3 and 36% for data sets corresponding to hm and t4, while the remaining rates are classied with error around 20% aside from 10 5 for the t4 environment, which has 12% error, and 10 3 forhm which has 26% error. Training and classifying on data from populations which were selected to produce the sin phenotype (Table A.6) classify recombination rates with similar error for each rate, in the range of 23% to 26%. For point mutation rates of 10 4 and 10 5 (Tables A.10 and A.11), higher rates of recombination are more readily classied than lower rates of recombination. When the rate of mutation is 10 3 , the highest rate of recombination is very accurately classied at 3%, while recombination rates of 10 4 and 10 5 are classied with error rates of 26% and 17%. 3.4 Discussion Generally, there is evidence that the most \dicult" evolutionary contexts are the most constrained, and so networks which evolve within these constraints possess features which drift more slowly over time. The rand hom environment requires networks to be robust to direct perturbation in initial conditions for all phenotypes, a feature which is not present in the other environments. The sin phenotype has the most complex curves which require more specic network adaptations to express. The highest mutation rate also introduces more structural perturbation to networks, and so may 57 ultimately be more restrictive in adaptive network structure in order to provide robustness to mutation. These pressures and constraints interact with one another in complex ways. The rand hom environment is generally the most disruptive to the ability of individuals to produce their selected phenotype, and so it appears that the ways in which networks evolve to handle the perturbation of initial conditions is distinct versus networks that do not have to account for that possibility. Yet, when the phenotype a simulation is selecting for is the sin target, the rand hom environment does not cause networks to be constrained in a way that is readily identied by the random forest ensembles trained on that data subset. Conversely, the sin target is correctly classied with high accuracy (< 10% error) when data is subset by the rand hom environment. Whether this is because networks which produce the sin phenotype are inherently more robust to perturbations in initial conditions, the selective benet of further adaptation toward the sin phenotype is signicantly greater than adapting to handle perturbations, or some other phenomena is not clear from these preliminary analyses. In contrast, when populations are selected to produce the hm phenotype, classifying based on environment results in an error rate of 3% for the rand hom environment. This is the only case in which an environment is ever classied with less than 10% error. The hm phenotype has initial conditions dependent on environmental signal, which is not the case for any other phenotype considered here and is, in fact, impossible for individuals in these simulations to achieve. Individuals evolving to produce the other phenotypes an reliably \target" their initial conditions independent of what's going on in the environment by relying on their internal state which cannot be perturbed by the rand hom environment to at least be \close" to the correct trajectory near time 0. This is not the case when selection for the hm phenotype is applied, so presumably further perturbation of initial 58 conditions will result in even stronger selection for networks that are able to quickly converge on the correct trajectory for the current environmental signal regardless of what the starting phenotypic signals are, and the network structures which evolve to accomplish this when selecting for hm are unique in the rand hom environment compared to other environments. These results also show that selecting for phenotypest3 andt4 causes the evolutionary process to produce networks that are in many ways similar between the networks since they are the two target phenotypes most often confused for one another and classiers trained on populations selected to produce those two phenotypes behave overall very similarly. However, there are also two specic circumstances in which these patterns do not hold. First, the one context in which t3 or t4 are each confused for a dierent target, hm, by a classier is when the random forest ensemble is trained on data generated by simulations that experienced the rand hom environment. In this case, most of the error for the selection of sin, t3 and t4 phenotypes result from misclassifying those data as hm. This is despite the fact that sin is the most accurately classied target of selection for that data set, as previously mentioned. One possible explanation is that because the hm target is always experiencing situations where its initial conditions are \o", it develops robustness to perturbations more quickly than populations evolving toward other targets. As a result, the random forest ensemble recognizes network statistics associated with this adaptation much more frequently in the populations being selected to produce the hm phenotype within the rst 5; 000 iterations, while populations experiencing selection for another phenotype experience convergent evolution for similar adaptations. As a result, when they are misclassied, they are mistaken for the hm phenotype. The second circumstance in whicht3 andt4 vary is in how random forests trained on data which experienced selection for one of these two phenotypes classies the cycle environment specically. 59 In this case, random forests trained on populations which experienced selection for t4 had twice as much error when classifying the cycle environment than random forests trained on populations which experienced selection for t3. This combination of expected similarity and occasional, sur- prising divergence between two evolutionary contexts that are only subtly dierent is extremely encouraging as it means that the simulation framework we developed is a useful and powerful tool to directly observe how similar but distinct phenotypic pressures will produce networks with only some convergent features. This will allow us to study and understand what will tend to cause evolutionary trajectories to tip towards convergence or divergence. Regarding mutation and recombination rates, the patterns here, while interesting, are overall more consistent across contexts. Since these two rates are what induce and mix variation in our sim- ulated populations over time, and they are each quantitative rates rather than qualitatively varying factors in the way that environments or targets of selection are, the results must be interpreted in that context. We expected mutation rates to be linearly related to rates of error in classication, though the direction of that relationship is at rst surprising. Since most populations have already reached a relatively stable level of tness by 5; 000 iterations, we expected to see higher rates of mutation be correlated with higher rates of error since mutations are fundamentally where genes change and such change networks. However, the opposite is the case. While we don't yet know why, we have two hypotheses that warrant further investigation. The rst is that in fact networks are continuing to adaptively evolve in ways that are not immediately re ected in tness, and that higher rates of mutation result in populations that more quickly reach a point where network structure reaches some local optimum and that selection is strong enough to prevent signicant drift due to mutations. The fact that classifying mutation 60 rates of 10 4 often have very similar error to either of the other two mutation rates is somewhat supportive of this hypothesis, as in some contexts a per base mutation rate of 10 4 may be high enough to reach a stable local optima within 10; 000 iterations. The second hypothesis is that when mutation rate is high, the networks evolve to be robust to mutational perturbations. Each of these hypotheses can be investigated and tested in future work. On the other hand, we had less of a prior expectation regarding how recombination rates would aect outcomes except when interacting with mutation rates. Since recombination only has a meaningful eect at sites that are heterozygous, the impact of recombination rates will depend heavily on other factors which aect the amount of genetic variation present in the population. The primary and most obvious driver of genetic variation in these simulations is point mutation rate. If there is a lot of variation, high rates of recombination will allow for selection to more eciently act on variants that aect tness, either promoting or reducing their presence and reducing the need for networks to evolve in such a way as to buer against functional genetic variation. This is likely why at the highest rate of mutation, we see the lowest error in classifying each rate of recombination compared to other rates of mutation, and why this is the only case of subsetting by mutation rate in which a recombination rate of 10 5 is more accurately classied than a rate of 10 4 . At lower mutation rates, recombination rates can be lower while still eectively allowing selection to act on individual variants. This hypothesis is supported by the fact that random forest ensembles trained on populations subset by mutation rate show that as mutation rate decreases, more and more of the error in classifying the 10 5 recombination rate moves to misclassifying that rate as 10 3 . As for how environmental signals and specic phenotypic selection aect how readily random forest ensembles trained on the rst 5; 000 iterations are able to classify mutation and recombination 61 rates at 10; 000 iterations, there denitely do seem to be distinct patterns of error accumulation, which indicate that networks evolve dierently with respect to these combination of factors. From the results presented in this chapter, however, it is dicult to say exactly what is driving this variation. 3.4.1 Conclusion Here we have shown through some preliminary results that these simulations provide a means to study distinct outcomes of network evolution for populations under stable selection. We can use such results to test and generate hypotheses regarding circumstances which may be more or less permissive regarding drift in network structure over time, leading to valuable insights when studying the evolution of gene networks in nature. 62 Chapter 4 Perspectives and Future Directions 4.1 Overview Through this work we have shown that there is a place for general evolutionary models and sim- ulations that are inspired by biological processes and are more complex and \realistic" than what currently exist. Yes, adding complexity does make both implementation and interpretation more dicult, but the results presented here are compelling evidence that it is possible to build a model that behaves sensibly at a level where existing theory and results are already informative, while producing results that allow for insights that are not readily elucidated by existing results and theory as well as providing a basis for testable hypotheses as biological data sets and technology become more sophisticated. Here we enumerate and expand upon some possible future directions to take this simulation and the data it produces. 63 4.2 Future \experiments" and analysis First, we discuss in more detail future directions and hypotheses mentioned in the previous chapters. 4.2.1 Previously mentioned In the rst chapter, we noted that most of the adaptation to any given environmental context and phenotypic selection occur well within the rst 1; 000 iterations in all the simulations presented here. Since we only collected population snapshots every 1; 000 iterations, we only have high level summary statistics for simulations during this time and no information regarding what is going on at a genetic or network level to promote the bulk of initial adaptation. Re-running simulations over just this time frame but outputting much more granular data will allow us to better understand how early adaptation to novel selective pressures occurs within our simulations, which may provide insight into how this process unfolds in nature. In the second chapter, we identied evidence that selection imposes some boundaries on poten- tial network congurations and that otherwise evolution will produce features that fall within those boundaries, all while continuing to produce a specic phenotype. We allude that this is possibly an avenue to study cryptic variation, which is the phenomena where a population of organisms which are genetically diverse but produce very similar phenotypes in one environment produce dramat- ically variable phenotypes in another. By running new simulations wherein we take a genetically diverse sample individuals well adapted to a given environment and observing their phenotypic ex- pression in a new environment, we should see an increase in phenotypic variation compared to their \natural" environment. We then could investigate the specic network properties and structures which underly cryptic variation in our simulated data to gain insight into the types of gene network 64 features to look for in real populations to predict the degree or nature of cryptic variation those populations express. Additionally, for all random forest ensembles presented, we did not analyze the importance of the specic network statistics used to train the forests, as mentioned at the end of chapter 2. By exploring these features and their importance in more detail, we will be able to better understand which specic statistics are expected to be useful predictors of certain types of evolutionary forces acting on a population. If these statistics are similarly descriptive for variation in gene networks that have a known evolutionary history, then we may be able to infer something about the circumstances through which variation in gene networks that are more dicult to directly observe evolve. Also mentioned at the end of chapter 2, characterizing specic network structures, such as motifs, their function, and their evolution, is an exciting future direction. This will require analysis at an individual level, but, if executed will provide a more detailed view of specic network evolution at a population level over thouands of generations than has ever been explored before, to our knowledge. The last future direction mentioned in chapter 2 is that these simulations are exible enough to combinatorially test a very wide range of parameters against one another to better understand the complex interplay between environment, phenotypic selection, mutation rates, and recombination rates, not to mention other parameters we did not vary here such as strength of selection. Many existing models of gene network evolution are either incredibly specic to a particular real gene network, or they are much simpler than the simulation framework described here, which preclude their ability to explore as wide a range of phenomena as we can using a single method. Since running this type of combinatorial evolution experiment in reality is much more complicated and dicult than through these simulations, we argue that these simulations (or others of similar complexity) oer a valuable tool to generate comparable data sets across many hundreds or thousands of distinct 65 \treatments" with a number of replicates only restricted by disk space, to allow investigation of pairwise, three-way, four-way, or higher order interactions in a way that is not feasible with any other existing method. In chapter 3, we found evidence that even after populations had reached a stable level of tness, their networks continue to evolve. This is not a surprising result, but what is interesting is that those data suggest that dierent environmental and selective contexts aect how quickly that \neutral" evolution occurs. Readily observing this type of behavior in our simulations indicate that they may be useful as an empirical way to understand how natural selection and genetic drift in the context of various environments interact to produce patterns of evolution and variation in the gene networks of nite populations over time. As mentioned in both chapters 2 and 3, we can use our simulation to study the evolution of robustness to mutational perturbations. By taking an existing population and modifying some number of their genes, we can see the degree to which their networks are aected by a given mutational load. We could then characterize how strongly this robustness correlates with increased mutation rate, as well as strength of selection, to determine how much robustness develops due to non-adaptive forces rewiring gene networks over time as opposed to that robustness providing a selective advantage. Another hypothesis mentioned in both chapters 2 and 3 is that it is possible that while the median and best tness achieved by a population remains stable in most cases after a few thousand iterations, it is possible that these populations are continuing to adapt in ways that are not directly identiable by those tness measures. Some examples of what this could mean is that populations in environments with varying signals, networks are evolving to more eciently handle that variation, or as previously mentioned if robustness to mutation is adaptive, then these populations may be 66 evolving networks that are more robust against mutations. Understanding how and if adaptation beyond strictly improving the calculated tness value of individuals in our simulations is occurring will be very important in understanding whether the accumulation and change in variation over time is due to neutral drift or is due to some form of \cryptic" selection. 4.2.2 Further directions Expanding on the idea of studying cryptic variation, it would be very interesting to allow populations to continue evolving after introducing them to a novel environment. This would let us observe the rate at which an already evolved network adapts to a novel environment, as well as the nature of underlying network changes which facilitate that adaptation. Comparing these populations to populations which have adapted to that environment \from scratch", similar to how initial adaptation occurred for all simulations presented here, will additionally provide insight into how much existing network structure is re-purposed to handle the new environment versus how much the network gets re-wired, both in the short and long term. A similar set of experiments could be performed where instead of (or in addition to) the environment changing, the selected phenotype changes as well. Another exciting direction is to study speciation using our simulations. By only periodically crossing otherwise reproductively isolated populations, we could observe the full process of specia- tion rst hand, as well as test various contexts which may eect the speed and underlying nature of the speciation that is observed. Though there are plenty of examples of species and sub-species in various stages of speciation, the entire process has never been observed from start to nish in natural populations. As such, generating simulated data sets that pass through various points 67 of reproductive compatibility until they are completely incompatible has tremendous potential to provide novel insights into the nature and progression of this fundamental evolutionary process. Some of the features that our simulation, grn-evo, possess that we have not varied or used for the results and analysis presented here are the ability to vary the strengths of selection, population size, proportion of the population replaced each iteration, and chromosome number, as well as a very rough implementation of allowing point mutation and recombination rates to evolve and change in lineages. Incorporating some or all of these into future data sets and experiments will allow for many more relevant and interesting biological phenomena to be explored. 4.3 Future extensions and modications to grn-evo Beyond just generating new data sets and analyzing them, rening and expanding the features and behavior of grn-evo is also possible. Brie y, some of the ideas we have are 1. Add new types of mutations, such as duplications and deletions not directly related to recombi- nation, translocations of segments of DNA, or even chromosomal level changes. Implementing some sort of transposable element may also be a worthwhile avenue. 2. Re-work how genes are represented. At the moment each gene has an identical length, and performs exactly one function. By developing a more modular approach to gene representa- tion, we would be able to study the evolution of gene lengths as well as how genes may evolve to have many or few functions. 3. Implement some way for multiple \species" to exist simultaneously in the same environment, and force them to compete for resources. 4. Allow population sizes to change over the course of a simulation. 68 5. Allow individuals to in uence the environment. These are just a few possibilities. Each of them, aside from re-working gene representations, could be implemented in a way so that the populations the simulation subsequently produces are still comparable to all the data and results presented here. And each of them will allow for even more investigation into open questions in genetics and evolution. 4.4 Conclusion We genuinely believe that we have produced an extremely powerful and versatile tool to study and understand the evolution of gene networks in populations over time. The results presented in previous chapters and future directions outlined here are just the proverbial tip of the iceberg when it comes to what can still be done in silico to study evolutionary systems that are not feasible with existing technology on real populations. As available computational power increases and the cost of data storage plummets, this simulation or one similar to it is almost certain to contribute signicantly to our understanding of evolutionary processes and have immediate impacts in elds from agriculture to medicine to synthetic biology. 69 Reference List [1] H. Allen Orr. The genetic theory of adaptation: A brief history. Nature Reviews Genetics, 6(2):119{127, feb 2005. [2] Michael Lynch. The frailty of adaptive hypotheses for the origins of organismal complexity. Proceedings of the National Academy of Sciences, 104(suppl 1):8597 LP { 8604, may 2007. [3] Joanna Masel and Mark L. Siegal. Robustness: mechanisms and consequences. Trends in Genetics, 25(9):395{403, sep 2009. [4] Pierre Alexis Gros, Herv e Le Nagard, and Olivier Tenaillon. The evolution of epistasis and its links with genetic robustness, complexity and drift in a phenotypic model of adaptation. Genetics, 182(1):277{293, 2009. [5] Joshua L. Payne and Andreas Wagner. The robustness and evolvability of transcription factor binding sites. Science, 343(6173):875{877, 2014. [6] Joanna Masel and Meredith V. Trotter. Robustness and evolvability. Trends in Genetics, 26(9):406{414, sep 2010. [7] Tobias Uller, Armin P Moczek, Richard A Watson, and Kevin N Laland. Developmental Bias and Evolution : A Regulatory Network Perspective. Genetics, 209(August):949{966, 2018. [8] Richelle Sopko, Dongqing Huang, Nicolle Preston, Gordon Chua, Bal azs Papp, Kimberly Kafadar, Mike Snyder, Stephen G. Oliver, Martha Cyert, Timothy R. Hughes, Charles Boone, and Brenda Andrews. Mapping pathways and phenotypes by systematic gene overexpression. Molecular Cell, 21(3):319{330, 2006. [9] Santiago F. Elena and Richard E. Lenski. Evolution experiments with microorganisms: The dynamics and genetic bases of adaptation. Nature Reviews Genetics, 4(6):457{469, jun 2003. [10] Roded Sharan and Trey Ideker. Modeling cellular machinery through biological network com- parison. Nature Biotechnology, 24(4):427{433, apr 2006. [11] Guy Karlebach and Ron Shamir. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology, 9(10):770{780, oct 2008. [12] J. Arjan G M De Visser and Joachim Krug. Empirical tness landscapes and the predictability of evolution. Nature Reviews Genetics, 15(7):480{490, 2014. 70 [13] Ella Preger-Ben Noon, Gonzalo Sabar s, Daniela M. Ortiz, Jonathan Sager, Anna Liebowitz, David L. Stern, and Nicol as Frankel. Comprehensive Analysis of a cis-Regulatory Region Reveals Pleiotropy in Enhancer Function. Cell Reports, 22(11):3021{3031, 2018. [14] Maria Angels De Luis Balaguer, Adam P. Fisher, Natalie M. Clark, Maria Guadalupe Fernandez-Espinosa, Barbara K. M oller, Dolf Weijers, Jan U. Lohmann, Cranos Williams, Oscar Lorenzo, and Rosangela Sozzani. Predicting gene regulatory networks by combining spatial and temporal gene expression data in Arabidopsis root stem cells. Proceedings of the National Academy of Sciences of the United States of America, 114(36):E7632|-E7640, sep 2017. [15] Mark G.F. F Sun and Philip M. Kim. Evolution of biological interaction networks: from models to real data. Genome Biology, 12(12), dec 2011. [16] Charles Ofria and Claus O. Wilke. Avida: A Software Platform for Research in Computational Evolutionary Biology. Articial Life, 10(2):191{229, mar 2004. [17] Guillaume Martin, Santiago F. Elena, and Thomas Lenormand. Distributions of epistasis in microbes t predictions from a tness landscape model. Nature Genetics, 39(4):555{560, 2007. [18] Isaac Salazar-Ciudad and Jukka Jernvall. A computational model of teeth and the develop- mental origins of morphological variation. Nature, 464(7288):583{586, mar 2010. [19] Juan R. Perilla, Boon Chong Goh, C. Keith Cassidy, Bo Liu, Rafael C. Bernardi, Till Rudack, Hang Yu, Zhe Wu, and Klaus Schulten. Molecular dynamics simulations of large macromolec- ular complexes. Current Opinion in Structural Biology, 31:64{74, apr 2015. [20] Antonios Kioukis and Pavlos Pavlidis. Evolution of gene regulatory networks by means of selection and random genetic drift. bioRxiv, page 449645, 2018. [21] Jerey E. Barrick and Richard E. Lenski. Genome dynamics during experimental evolution. Nature Reviews Genetics, 14(12):827{839, dec 2013. [22] Michael Lynch. The evolution of genetic networks by non-adaptive processes. Nature Reviews Genetics, 8(10):803{813, 2007. [23] Ulises Rosas, Nick H. Barton, Lucy Copsey, Pierre Barbier de Reuille, and Enrico Coen. Cryptic variation between species and the basis of hybrid performance. PLoS Biology, 8(7), 2010. [24] Watal M. Iwasaki, Masaki E. Tsuda, and Masakado Kawata. Genetic and environmental factors aecting cryptic variations in gene regulatory networks. BMC Evolutionary Biology, 13(1), 2013. [25] Zsolt Boldogk oi. Gene network polymorphism is the raw material of natural selection: The selsh gene network hypothesis. Journal of Molecular Evolution, 59(3):340{357, sep 2004. [26] Joshua S Schiman and Peter L Ralph. System drift and speciation. bioRxiv, pages 1{24, 2017. [27] R. Lande. Adaptation to an extraordinary environment by evolution of phenotypic plasticity and genetic assimilation. Journal of Evolutionary Biology, 22(7):1435{1446, jul 2009. 71 [28] Marjon G.J. J De Vos, Alexandre Dawid, Vanda Sunderlikova, and Sander J. Tans. Break- ing evolutionary constraint with a tradeo ratchet. Proceedings of the National Academy of Sciences of the United States of America, 112(48):14906{14911, 2015. [29] Annalise B. Paaby and Matthew V. Rockman. Cryptic genetic variation: Evolution's hidden substrate. Nature Reviews Genetics, 15(4):247{258, 2014. [30] Hideki Innan and Fyodor Kondrashov. The evolution of gene duplications: Classifying and distinguishing between models. Nature Reviews Genetics, 11(2):97{108, 2010. [31] Sewall Wright. Evolution in Mendelian Populations. Genetics, 16(2):97{159, mar 1931. [32] Ronald Aylmer Fisher. The genetical theory of natural selection. Clarendon Press, Oxford, 1930. [33] P. A.P. Moran. Random processes in genetics. Mathematical Proceedings of the Cambridge Philosophical Society, 54(1):60{71, 1958. [34] Arend Hintze and Christoph Adami. Evolution of complex modular biological networks. PLoS Computational Biology, 4(2), 2008. [35] Miguel A. Fortuna, Luis Zaman, Charles Ofria, and Andreas Wagner. The genotype-phenotype map of an evolving digital organism. PLoS Computational Biology, 13(2):1{20, 2017. [36] Thomas Labar and Christoph Adami. Evolution of drift robustness in small populations. Nature Communications, 8(1), dec 2017. [37] Aditi Gupta, Thomas LaBar, Michael Miyagi, and Christoph Adami. Evolution of Genome Size in Asexual Digital Organisms. Scientic Reports, 6, may 2016. [38] Pieter Meysman, Julio Collado-Vides, Enrique Morett, Roberto Viola, Kristof Engelen, and Kris Laukens. Structural properties of prokaryotic promoter regions correlate with functional features. PLoS ONE, 9(2), 2014. [39] Patricia J. Wittkopp and Gizem Kalay. Cis-regulatory elements: Molecular mechanisms and evolutionary processes underlying divergence. Nature Reviews Genetics, 13(1):59{69, 2012. [40] Hao Li and Alexander D. Johnson. Evolution of transcription networks-lessons from yeasts. Current Biology, 20(17), sep 2010. [41] Albert L aszl o Barab asi and R eka Albert. Emergence of scaling in random networks. Science, 286(5439):509{512, oct 1999. [42] Raya Khanin and Ernst Wit. How scale-free are biological networks. Journal of Computational Biology, 13(3):810{818, apr 2006. [43] Marko Gosak, Rene Markovi c, Jurij Dolen sek, Marjan Slak Rupnik, Marko Marhl, Andra z Sto zer, and Matja z Perc. Network science of biological systems at dierent scales: A review. Physics of Life Reviews, 24:118{135, mar 2018. [44] Berta Verd, Nicholas AM Monk, and Johannes Jaeger. Modularity, criticality, and evolvability of a developmental gene regulatory network. eLife, 8, jun 2019. 72 [45] Joshua L. Payne and Andreas Wagner. Function does not follow form in gene regulatory circuits. Scientic Reports, 5, aug 2015. [46] Mark Siegal and Jun-Yi Leu. On the Nature and Evolutionary Impact of Phenotypic Ro- bustness Mechanisms. Annual Review of Ecology, Evolution, and Systematics, 5(5):395{404, 2017. [47] A Hagberg, P Swart, and DS Chult. Exploring network structure, dynamics, and function using NetworkX. 2008. [48] R Core Team. R: A Language and Environment for Statistical Computing, 2018. [49] Andy Liaw and Matthew Wiener. Classication and Regression by RandomForest. Technical report, 2001. [50] Jonas Defoort, Yves Van De Peer, and Vanessa Vermeirssen. Function, dynamics and evolution of network motif modules in integrated gene regulatory networks of worm and plant. Nucleic Acids Research, 46(13):6480{6503, 2018. [51] Rebekah M. Charney, Kitt D. Paraiso, Ira L. Blitz, and Ken W.Y. Cho. A gene regulatory program controlling early Xenopus mesendoderm formation: Network conservation and motifs, jun 2017. [52] Peyman Zarrineh, Aminael S anchez-Rodr guez, Nazanin Hosseinkhan, Zahra Narimani, Kath- leen Marchal, and Ali Masoudi-Nejad. Genome-Scale Co-Expression Network Comparison across Escherichia coli and Salmonella enterica Serovar Typhimurium Reveals Signicant Con- servation at the Regulon Level of Local Regulators Despite Their Dissimilar Lifestyles. PLoS ONE, 9(8):e102871, aug 2014. [53] Michael C. Oldham, Steve Horvath, and Daniel H. Geschwind. Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proceedings of the National Academy of Sciences of the United States of America, 103(47):17973{17978, nov 2006. [54] Ricard Albalat and Cristian Ca~ nestro. Evolution by gene loss, jul 2016. [55] Fyodor A. Kondrashov. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proceedings of the Royal Society B: Biological Sciences, 279(1749):5048{5057, 2012. [56] Brian Charlesworth and Deborah Charlesworth. Neutral Variation in the Context of Selection. Molecular Biology and Evolution, 35(6):1359{1361, jun 2018. 73 Appendix A Tables 74 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Target hm 419 13 10 8 0.069 80 4 5 1 0.11 70 11 6 3 0.22 sin 25 384 16 25 0.15 2 83 3 2 0.078 4 79 3 4 0.12 t3 20 8 387 35 0.14 2 0 80 8 0.11 4 3 72 11 0.2 t4 15 14 43 378 0.16 2 2 4 82 0.089 3 8 13 66 0.27 Mutation rate 10 3 545 54 1 0.092 115 4 1 0.042 115 5 0 0.042 10 4 51 496 53 0.17 21 95 4 0.21 36 84 0 0.3 10 5 0 34 566 0.057 0 31 89 0.26 0 61 59 0.51 Recombination rate 10 3 496 64 40 0.17 104 12 4 0.13 94 15 11 0.22 10 4 53 473 74 0.21 6 105 9 0.12 17 87 16 0.28 10 5 43 61 496 0.17 8 12 100 0.17 11 26 83 0.31 Table A.1: (Constant environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with constant environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Target hm 401 11 11 27 0.11 78 4 1 7 0.13 73 6 5 6 0.19 sin 13 406 15 16 0.098 0 87 1 2 0.033 0 84 3 3 0.067 t3 13 21 381 35 0.15 0 8 75 7 0.17 2 7 71 10 0.21 t4 21 15 39 375 0.17 1 3 7 79 0.12 5 5 11 69 0.23 Mutation rate 10 3 548 52 0 0.087 117 3 0 0.025 116 4 0 0.033 10 4 48 491 61 0.18 18 99 3 0.17 30 89 1 0.26 10 5 0 34 566 0.057 0 17 103 0.14 1 40 79 0.34 Recombination rate 10 3 495 60 45 0.17 108 6 6 0.1 96 13 11 0.2 10 4 61 461 78 0.23 11 96 13 0.2 17 83 20 0.31 10 5 28 77 495 0.17 5 11 104 0.13 7 20 93 0.23 Table A.2: (Cycle environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with cycle environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 75 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Target hm 390 12 23 25 0.13 74 2 5 9 0.18 57 16 5 12 0.37 sin 17 380 30 23 0.16 0 86 3 1 0.044 1 82 3 4 0.089 t3 13 22 368 47 0.18 2 5 77 6 0.14 2 9 70 9 0.22 t4 18 13 51 368 0.18 2 4 6 78 0.13 6 5 15 64 0.29 Mutation rate 10 3 550 49 1 0.083 112 8 0 0.067 115 5 0 0.042 10 4 53 487 60 0.19 21 93 6 0.23 27 91 2 0.24 10 5 0 32 568 0.053 0 21 99 0.17 2 47 71 0.41 Recombination rate 10 3 499 58 43 0.17 107 9 4 0.11 87 21 12 0.28 10 4 65 466 69 0.22 8 94 18 0.22 7 84 29 0.3 10 5 37 55 508 0.15 3 7 110 0.083 7 16 97 0.19 Table A.3: (Rand environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with rand environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Target hm 401 26 14 9 0.11 77 9 1 3 0.14 77 9 4 0 0.14 sin 18 409 6 17 0.091 2 88 0 0 0.022 7 82 1 0 0.089 t3 31 13 355 51 0.21 6 4 68 12 0.24 10 6 68 6 0.24 t4 23 10 63 354 0.21 11 4 13 62 0.31 16 4 13 57 0.37 Mutation rate 10 3 555 42 3 0.075 110 10 0 0.083 108 11 1 0.1 10 4 58 498 44 0.17 10 106 4 0.12 10 109 1 0.092 10 5 0 33 567 0.055 0 25 95 0.21 1 33 86 0.28 Recombination rate 10 3 426 99 75 0.29 92 15 13 0.23 87 19 14 0.28 10 4 86 405 109 0.33 16 93 11 0.23 23 75 22 0.38 10 5 42 106 452 0.25 5 16 99 0.17 10 23 87 0.28 Table A.4: (Rand hom environment) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with rand hom environment are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 76 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 361 34 35 20 0.2 75 2 5 8 0.17 53 11 9 17 0.41 cycle 55 323 61 11 0.28 12 62 10 6 0.31 13 50 14 13 0.44 rand 45 75 307 23 0.32 5 10 65 10 0.28 11 19 46 14 0.49 rand hom 31 16 34 369 0.18 3 1 4 82 0.089 2 1 0 87 0.033 Mutation rate 10 3 547 52 1 0.088 113 7 0 0.058 110 10 0 0.083 10 4 45 507 48 0.15 20 94 6 0.22 26 92 2 0.23 10 5 0 36 564 0.06 0 25 95 0.21 0 33 87 0.28 Recombination rate 10 3 478 68 54 0.2 104 10 6 0.13 89 14 17 0.26 10 4 62 454 84 0.24 15 91 14 0.24 29 77 14 0.36 10 5 32 80 488 0.19 4 12 104 0.13 10 13 97 0.19 Table A.5: (hm target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting hm are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 329 45 33 43 0.27 72 6 7 5 0.2 64 11 10 5 0.29 cycle 42 309 52 47 0.31 7 74 4 5 0.18 16 57 8 9 0.37 rand 41 59 298 52 0.34 6 12 68 4 0.24 11 12 55 12 0.39 rand hom 50 39 42 319 0.29 10 6 4 70 0.22 10 20 5 55 0.39 Mutation rate 10 3 544 52 4 0.093 112 8 0 0.067 107 12 1 0.11 10 4 46 503 51 0.16 7 111 2 0.075 17 102 1 0.15 10 5 0 19 581 0.032 2 21 97 0.19 4 58 58 0.52 Recombination rate 10 3 499 55 46 0.17 102 8 10 0.15 89 18 13 0.26 10 4 57 448 95 0.25 11 103 6 0.14 9 93 18 0.23 10 5 42 66 492 0.18 4 12 104 0.13 4 25 91 0.24 Table A.6: (sin target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting sin are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 77 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 375 29 24 22 0.17 74 3 10 3 0.18 67 5 14 4 0.26 cycle 36 339 53 22 0.25 3 80 5 2 0.11 2 74 11 3 0.18 rand 48 43 347 12 0.23 2 7 79 2 0.12 6 18 63 3 0.3 rand hom 15 16 14 405 0.1 1 6 2 81 0.1 1 7 4 78 0.13 Mutation rate 10 3 559 41 0 0.068 115 4 1 0.042 119 1 0 0.0083 10 4 50 491 59 0.18 25 90 5 0.25 20 98 2 0.18 10 5 0 38 562 0.063 0 30 90 0.25 0 49 71 0.41 Recombination rate 10 3 477 79 44 0.2 108 5 7 0.1 93 14 13 0.23 10 4 74 444 82 0.26 9 98 13 0.18 11 83 26 0.31 10 5 63 77 460 0.23 7 10 103 0.14 11 14 95 0.21 Table A.7: (t3 target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting t3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 380 29 27 14 0.16 79 3 8 0 0.12 66 7 15 2 0.27 cycle 41 349 43 17 0.22 7 74 6 3 0.18 16 53 14 7 0.41 rand 38 38 357 17 0.21 7 2 77 4 0.14 12 10 59 9 0.34 rand hom 14 10 18 408 0.093 1 3 2 84 0.067 3 4 3 80 0.11 Mutation rate 10 3 559 41 0 0.068 119 1 0 0.0083 117 3 0 0.025 10 4 40 497 63 0.17 12 105 3 0.12 22 98 0 0.18 10 5 0 34 566 0.057 0 18 102 0.15 1 42 77 0.36 Recombination rate 10 3 501 66 33 0.17 109 7 4 0.092 97 9 14 0.19 10 4 58 461 81 0.23 9 99 12 0.17 17 77 26 0.36 10 5 49 59 492 0.18 3 9 108 0.1 4 11 105 0.12 Table A.8: (t4 target) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations targetting t4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 78 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 449 55 62 34 0.25 96 6 14 4 0.2 88 14 12 6 0.27 cycle 57 437 81 25 0.27 7 101 6 6 0.16 12 92 10 6 0.23 rand 76 87 394 43 0.34 14 11 91 4 0.24 12 13 91 4 0.24 rand hom 26 16 24 534 0.11 3 4 3 110 0.083 5 6 4 105 0.12 Target phenotype hm 554 8 19 19 0.077 110 3 4 3 0.083 101 4 7 8 0.16 sin 5 573 7 15 0.045 1 119 0 0 0.0083 1 119 0 0 0.0083 t3 23 8 474 95 0.21 3 1 107 9 0.11 4 1 100 15 0.17 t4 26 13 96 465 0.23 7 2 14 97 0.19 4 3 17 96 0.2 Recombination rate 10 3 656 119 25 0.18 142 18 0 0.11 155 2 3 0.031 10 4 90 593 117 0.26 23 123 14 0.23 24 119 17 0.26 10 5 12 124 664 0.17 0 13 147 0.081 2 26 132 0.17 Table A.9: (Mutation rate 10 3 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 452 49 50 49 0.25 96 3 13 8 0.2 77 17 10 16 0.36 cycle 59 429 70 42 0.28 7 96 7 10 0.2 12 60 18 30 0.5 rand 46 73 446 35 0.26 4 12 96 8 0.2 14 24 61 21 0.49 rand hom 40 41 38 481 0.2 7 5 6 102 0.15 6 4 5 105 0.12 Target phenotype hm 511 17 30 42 0.15 99 9 6 6 0.17 81 15 12 12 0.33 sin 34 494 40 32 0.18 1 115 2 2 0.042 3 114 1 2 0.05 t3 20 35 461 84 0.23 5 8 93 14 0.23 9 18 76 17 0.37 t4 39 25 73 463 0.23 2 9 6 103 0.14 13 12 26 69 0.42 Recombination rate 10 3 630 88 82 0.21 144 15 1 0.1 134 24 2 0.16 10 4 148 527 125 0.34 27 122 11 0.24 35 116 9 0.28 10 5 63 106 631 0.21 9 32 119 0.26 12 53 95 0.41 Table A.10: (Mutation rate 10 4 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 79 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 505 37 38 20 0.16 96 8 6 10 0.2 75 9 16 20 0.38 cycle 48 422 79 51 0.3 9 93 10 8 0.23 23 56 21 20 0.53 rand 66 47 448 39 0.25 7 3 102 8 0.15 22 16 62 20 0.48 rand hom 44 53 44 459 0.23 7 4 9 100 0.17 5 8 14 93 0.23 Target phenotype hm 520 32 21 27 0.13 98 10 5 7 0.18 88 22 6 4 0.27 sin 36 506 26 32 0.16 2 114 2 2 0.05 6 106 5 3 0.12 t3 42 30 492 36 0.18 3 9 95 13 0.21 5 21 83 11 0.31 t4 21 42 29 508 0.15 5 6 9 100 0.17 9 23 14 74 0.38 Recombination rate 10 3 602 98 100 0.25 143 10 7 0.11 121 25 14 0.24 10 4 125 538 137 0.33 25 126 9 0.21 50 97 13 0.39 10 5 104 107 589 0.26 27 25 108 0.33 51 40 69 0.57 Table A.11: (Mutation rate 10 5 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with point mutation rate 10 5 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 493 40 44 23 0.18 103 6 9 2 0.14 81 13 17 9 0.33 cycle 55 445 71 29 0.26 7 96 12 5 0.2 16 88 10 6 0.27 rand 48 70 453 29 0.24 8 9 98 5 0.18 16 19 73 12 0.39 rand hom 32 37 34 497 0.17 6 5 2 107 0.11 5 3 5 107 0.11 Target phenotype hm 539 23 17 21 0.1 113 1 3 3 0.058 100 7 7 6 0.17 sin 23 539 18 20 0.1 3 116 1 0 0.033 5 111 3 1 0.075 t3 22 12 515 51 0.14 4 1 108 7 0.1 3 6 100 11 0.17 t4 23 20 56 501 0.17 3 5 7 105 0.12 8 8 16 88 0.27 Mutation rate 10 3 738 59 3 0.077 147 13 0 0.081 152 7 1 0.05 10 4 63 692 45 0.14 19 139 2 0.13 21 139 0 0.13 10 5 0 34 766 0.043 0 31 129 0.19 4 55 101 0.37 Table A.12: (Recombination rate 10 3 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 3 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 80 Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 483 41 48 28 0.2 101 9 6 4 0.16 85 12 14 9 0.29 cycle 45 461 64 30 0.23 6 97 9 8 0.19 6 84 12 18 0.3 rand 41 67 453 39 0.24 6 15 95 4 0.21 11 21 78 10 0.35 rand hom 42 47 44 467 0.22 4 0 5 111 0.075 9 3 4 104 0.13 Target phenotype hm 526 28 12 34 0.12 104 6 3 7 0.13 94 10 6 10 0.22 sin 31 538 18 13 0.1 0 114 3 3 0.05 2 112 3 3 0.067 t3 30 19 491 60 0.18 7 6 99 8 0.17 9 8 89 14 0.26 t4 25 17 60 498 0.17 5 1 8 106 0.12 3 6 19 92 0.23 Mutation rate 10 3 735 64 1 0.081 152 8 0 0.05 152 8 0 0.05 10 4 61 677 62 0.15 26 131 3 0.18 36 124 0 0.23 10 5 0 35 765 0.044 0 35 125 0.22 0 62 98 0.39 Table A.13: (Recombination rate 10 4 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 4 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. Iterations <= 5; 000 (Training set OOB estimate) 6; 000 10; 000 Environment constant 459 41 56 44 0.23 90 8 15 7 0.25 77 15 16 12 0.36 cycle 69 426 68 37 0.29 8 92 11 9 0.23 17 63 27 13 0.47 rand 62 59 426 53 0.29 8 7 95 10 0.21 12 13 79 16 0.34 rand hom 49 44 35 472 0.21 10 11 7 92 0.23 3 10 10 97 0.19 Target phenotype hm 523 16 41 20 0.13 98 7 7 8 0.18 81 25 4 10 0.33 sin 35 492 39 34 0.18 2 106 6 6 0.12 6 103 4 7 0.14 t3 43 29 458 70 0.24 4 7 95 14 0.21 5 15 83 17 0.31 t4 44 34 87 435 0.28 9 11 14 86 0.28 12 15 16 77 0.36 Mutation rate 10 3 735 65 0 0.081 157 3 0 0.019 157 3 0 0.019 10 4 57 665 78 0.17 22 131 7 0.18 38 121 1 0.24 10 5 0 61 739 0.076 0 31 129 0.19 0 69 91 0.43 Table A.14: (Recombination rate 10 5 ) Confusion matrices when random forest ensembles trained on the rst 5; 000 iterations of simulations with recombination rate 10 5 are used to classify data from later iterations. The factor and classes used to train a classier is shown on the left. Each of the three classiers has three confusion matrices shown. First is the out-of-bag estimates of error generated while training. The next two show the results of validating population data from iterations 6; 000 and 10; 000 against the model. The numbers to the right of the dashed lines for each confusion matrix represent the error associated with classifying observations coming from the class in the given row. 81 Appendix B Figures 82 Figure B.1: Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 83 Figure B.2: Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 84 Figure B.3: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 85 Figure B.4: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 86 Figure B.5: Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 87 Figure B.6: Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recom- bination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 88 Figure B.7: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recom- bination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 89 Figure B.8: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 90 Figure B.9: Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 91 Figure B.10: Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 92 Figure B.11: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 93 Figure B.12: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 94 Figure B.13: Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 95 Figure B.14: Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 96 Figure B.15: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 97 Figure B.16: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 98 Figure B.17: Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 99 Figure B.18: Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 100 Figure B.19: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 101 Figure B.20: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 102 Figure B.21: Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 103 Figure B.22: Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 104 Figure B.23: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 105 Figure B.24: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 106 Figure B.25: Minimum distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 107 Figure B.26: Minimum distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 108 Figure B.27: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 109 Figure B.28: Minimum distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 110 Figure B.29: Average distance from the target phenotype over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 111 Figure B.30: Average distance from the target phenotype over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 112 Figure B.31: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 113 Figure B.32: Average distance from the target phenotype over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 114 Figure B.33: Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 115 Figure B.34: Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 116 Figure B.35: Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 117 Figure B.36: Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 118 Figure B.37: Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 119 Figure B.38: Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 120 Figure B.39: Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 121 Figure B.40: Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 122 Figure B.41: Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 123 Figure B.42: Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 124 Figure B.43: Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 125 Figure B.44: Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 126 Figure B.45: Maximum chromosome length over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 127 Figure B.46: Maximum chromosome length over 5,000 iterations for simulations with a cycling environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 128 Figure B.47: Maximum chromosome length over 5,000 iterations for simulations with a random environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 129 Figure B.48: Maximum chromosome length over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 130 Figure B.49: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotypet3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 131 Figure B.50: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 132 Figure B.51: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 133 Figure B.52: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t3. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 134 Figure B.53: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotypet4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 135 Figure B.54: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 136 Figure B.55: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 137 Figure B.56: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype t4. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 138 Figure B.57: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 139 Figure B.58: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotypesin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 140 Figure B.59: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotypesin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 141 Figure B.60: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype sin. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 142 Figure B.61: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a constant environment and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 143 Figure B.62: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a cycling environment and selection for phenotypehm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 144 Figure B.63: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment and selection for phenotypehm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 145 Figure B.64: Number of unique genes possessed by the ttest individual over 5,000 iterations for simulations with a random environment that also perturbs phenotypic values and selection for phenotype hm. Data is subset by mutation rate and recombination rate. Each subplot has ten replicate runs, where each run is a dierent color line and the black line is the mean across all runs. 146
Abstract (if available)
Abstract
Studying gene networks and their evolution is hard. Inferring real networks is a daunting task that requires substantial data and sophisticated statistical methods. As such, our ability to study the evolution of biological networks is limited. Computational methods, then, are a compelling avenue explore and understand how gene networks might evolve. Most such methods are eitherentirely analytic or are constructed with a lot of assumptions aimed at making their results easier to characterize. In this work I introduce a new simulation framework called grn-evo that is more sophisticated than existing methods, and argue that the complexity of the resulting simulations is an asset that can be used to more deeply explore open questions in network evolution than existing methods. I additionally discuss possible future directions to take the analysis of the data sets presented, as well as further extensions to grn-evo itself.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The evolution of gene regulatory networks
PDF
Long term evolution of gene duplicates in arabidopsis polyploids
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Characterizing brain aging with neuroimaging, health, and genetic data
PDF
Inferring gene flow across spatial landscapes
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Application of machine learning methods in genomic data analysis
PDF
Investigating hematopoietic stem cells via multiscale modeling, single-cell genomics, and gene regulatory network inference
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Bayesian analysis of transcriptomic and genomic data
PDF
Plant genome wide association studies and improvement of the linear mixed model by applying the weighted relationship matrix
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Exploring the genetic basis of complex traits
PDF
Patterning and growth in biological systems: a computational exploration of biological robustness, epithelial growth dynamics and development
Asset Metadata
Creator
Conow, Christopher Ansel
(author)
Core Title
Investigating the evolution of gene networks through simulated populations
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
02/03/2020
Defense Date
12/04/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptation,digital organisms,Evolution,gene networks,gene regulatory networks,genetics,network statistics,networks,OAI-PMH Harvest,population genetics,simulation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey (
committee chair
), Haas, Stephan (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
cconow@gmail.com,conow@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-264599
Unique identifier
UC11674755
Identifier
etd-ConowChris-8138.pdf (filename),usctheses-c89-264599 (legacy record id)
Legacy Identifier
etd-ConowChris-8138.pdf
Dmrecord
264599
Document Type
Dissertation
Rights
Conow, Christopher Ansel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptation
digital organisms
gene networks
gene regulatory networks
genetics
network statistics
networks
population genetics
simulation