Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
(USC Thesis Other)
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Title: EVOLUTIONARY
GENOMIC ANALYSES in
HETEROGENEOUS
POPULATIONS of NON-
MODEL and MODEL
ORGANISMS
Author: Hosseinali Asgharian
Advisor: Sergey Nuzhdin
Program: Molecular Biology
Degree: Doctor of Philosophy (PhD)
Institution: Faculty of The USC Graduate
School
Degree conferral date: December 2016
2
Table of contents
Preface 3
Chapter 1. Transcriptomic analysis of male-feminizing Wolbachia infection in the
leafhopper Zyginidia pullula 11
Chapter 2. Evolutionary genomics of the house mosquito Culex pipiens: Population
structure and recent adaptive evolution 36
Chapter 3. Toward high resolution population genomics of archaeological samples: A
review 103
Chapter 4. Inferring the effective number of stem cells, somatic mutation rate and
cellular dynamics of regeneration in asexual Planaria 157
3
Preface
Population genetics was conceived as a theory-rich data-deficient field by the groundbreaking
works of Fisher, Haldane and Wright. Crow and Kimura’s 1970 book “An Introduction to
Population Genetics Theory” (1970) -a textbook through which several generations of
population geneticists were trained- or even a more recent work by Maynard Smith
“Evolutionary Genetics” (1998) exemplify the favor of old-style population genetics: rife with
rigorous mathematical treatment of intricate and sometimes highly ecologically-specialized
scenarios of evolution, but using the charmingly primitive beads-on-a-string model of genes and
often assuming a one-gene-one-character correspondence. The challenge back then was finding
real data to test the predictions made by all those interesting and complicated models. But
molecular data has been gradually sneaking into population genetics making it more
biologically coherent and mechanistically relevant. For that is the nature and appeal of
reductionist science in general: breaking down the phenomena at a higher level of organization
to mechanisms acting at the lower level, be it explaining chemical properties by physical laws or
connecting psychological observation to neuronal activities. Connecting population-level
behavior of living systems to their organismal, cellular and molecular properties, too, has been
an effort in that general direction. Examples are plentiful. Discovery of the molecular structure
of DNA and the details of its replication process facilitated understanding of what mutations
really are, how they may be caused and what factors influence their rate. Availability of a
number of protein sequences led to the emergence of the revolutionary neutral theory, the
concept of molecular clock and subsequently the idea of molecular phylogeny. The most recent
game-changer has been the storm of technological advances instigated by the Human Genome
Project which unleashed an avalanche of data of unprecedented volume. The challenge has
now reversed: novel theories and new methods are required to explore, summarize, visualize,
analyze and interpret the fast-growing volumes of genomic and else -omic data. In lieu of this,
several statistical methods initially developed for other purposes found uses in genetic data
analysis; for example, principal component analysis (PCA) was invented as a method of
dimensionality reduction in 1901 by Karl Pearson but was first used for the analysis of
4
population structure by Cavalli-sforza in 1978 and later gained more attention through
Patterson’s 2006 paper. On the other hand, biological problems have inspired development of
mathematical methods that are later utilized by other fields as well. Study of emergent
properties of systems or finding efficient algorithms for data mining are such examples. In Joel
Cohen’s words: “Mathematics Is Biology’s Next Microscope, Only Better; Biology Is
Mathematics’ Next Physics, Only Better.”
I have focused my research during PhD training on finding novel statistical methods to analyze -
omic data based on population and evolutionary genetic theory. I participated in multiple
projects where I designed and executed the main data analysis tasks while field work, lab work
and initial data pre-processing (e.g. QC and read mapping to call SNPs or calculate raw gene
expression values) were carried out by other members of the team. That is the niche I have
chosen for my future career. In order to equip myself for this role, I took several extra courses
and completed a Master’s in Biostatistics parallel to my PhD in Molecular Biology. This thesis
has been formatted into four chapters each corresponding to one of my projects. The findings
of two of the research projects have been already published. A review paper I contributed to
has been accepted for publication. A manuscript has been submitted presenting the findings of
the last project. In the following paragraphs I give a brief account of each project and give the
respective lists of collaborators.
Chapter 1. Transcriptomic analysis of male-feminizing Wolbachia infection in the leafhopper
Zyginidia pullula. Wolbachia infection is a very uniquely curious evolutionary case. It infects
~60-75% of all insect species, many crustaceans and several worm species as an obligate
intracellular symbiont; transmits only through the eggs from mother to offspring; and interferes
with the most basic instrument of continuity of life: reproduction. The phenotypic effect of the
infection varies by Wolbachia strain and host species. The four main phenotypes include
cytoplasmic incompatibility, male feminization, male killing and thelytokous parthenogenesis.
The latter three distort the progeny sex ratio in a female-biased direction and all four ensure
higher rates of bacterial transmission to subsequent generations at the expense of reducing the
5
host population’s reproductive capacity. It is appealing to find the direct molecular targets of
Wolbachia and many research projects so far have focused on finding the target within the sex
determination pathway.
We started our project with this question: In male-feminizing systems, where does Wolbachia
arrive in the pathways of sexual determination or differentiation and switch the developmental
path from male to female? We compared transcriptomes of infected and uninfected
chromosomal males and females of the leafhopper Zyginidia pullula. Only uninfected
chromosomal males are phenotypically male, whereas the three other groups develop into
functional females. We managed to identify certain components of the sex determination
pathway through homology search against fruit fly and pea aphid reference genomes, but were
not able to pinpoint the Wolbachia entry point. Then we asked a more fundamental question:
Does Wolbachia mainly affect the sex determination process? Through principal component
analysis of transcriptomes of the four groups (infected and uninfected chromosomal male and
females), we were able to show that surprisingly this was not the case. The transcriptomes
clustered not according to sexual phenotype (3 phenotypic females vs. 1 phenotypic male), but
based on infection status (2 uninfected samples vs. 2 infected samples). This meant that the
effect of Wolbachia infection at transcriptomic level was widespread and by no means limited
to sexual development and suggested that the search for Wolbachia target molecules should go
beyond the components of sex determination pathway. The findings of this study were
presented in a workshop and later on published in a special issue of Frontiers in Microbiology
on microbial symbioses (the paper is attached as a .pdf file to the electronic version of the
thesis).
- Asgharian H., Negri I., Chang P.L., Nuzhdin S.V. Transcriptomic study of Wolbachia-induced male feminization in
the leafhopper Zyginidia pullula. In: Second Symbiosis Workshop, Sierra Nevada Research Institute, Yosemite
National Park, May 19-20, 2012. (Poster)
- Asgharian, H., Chang, P. L., Mazzoglio, P. J., Negri, I. 2014. Wolbachia is not all about sex: male-feminizing
Wolbachia alters the leafhopper Zyginidia pullula transcriptome in a mainly sex-independent manner. Frontiers in
microbiology, 5.
6
Chapter 2. Evolutionary genomics of the house mosquito Culex pipiens: Population structure
and recent adaptive evolution. Using insights from evolutionary biology and particularly
population genetics for applied purposes is not a new idea. Conservation genetics of species in
danger of extinction or directed evolution of enzymes in the lab to produce superior catalysts
are familiar examples. In a similar way, evolutionary genetics of epidemiologically important
disease vectors can assist in designing more efficient control measures against them. For
instance, if genes known to confer resistance to insecticides in other studied systems appear
among the targets of ongoing positive selection in a vector population, resistance to that agent
can be anticipated before it becomes phenotypically prevalent. Upon accumulation of
knowledge on various functionally important genes, multiple characters (e.g. resistance to
multiple agents) might be detected through a single genomic scan of one pooled sample (at
least as a primary screen).
In our study we sought to answer questions regarding two fundamental population genetic
aspects of C. pipiens: population structure and natural selection. On population structure, we
asked: Does geography, habitat type or biological form mainly determine the organization of
genetic variation in Culex populations? Do genomic data support genetic isolation and
imminent speciation of pipiens and molestus forms, or on the contrary, we detect considerable
admixture between them? On the matter of natural selection, we asked: What genes and
biological functions are the targets of recent sweeps? To what degree do protein sequence
alterations and gene expression changes contribute to adaptation? What factors are the likely
causes of recent sweeps and do they act congruently or otherwise in different populations?
And finally, does multifunctionality (pleiotropy) constrain a gene’s potential for adaptive
evolution?
To investigate structure and admixture of populations, we used average pairwise Fst values
which reflect allele frequencies as well as a novel haplotype-independent approach for pooled
samples based on phylogenetic analysis of fixed positions in 10kb sliding windows along the
genome. Both methods confirmed isolation of C. pipiens and C. torrentium gene pools but
indicated considerable recent hybridization between certain molestus and pipiens populations
and between C. pipiens and C. quinquefasciatus.
7
We combined the signals from the Pool-hmm method (which uses allele frequency spectrum
regardless of annotations) and the ratio of nonsynonymous to synonymous polymorphic sites
to detect and confirm genomic targets of recent positive selection, and explored the potential
relevance of the target genes to environmental conditions and physiological traits of Culex. We
established the importance of gene expression regulation in the adaptation process by two
independent observations: high occurrence of non-genic sequences among Pool-hmm hits, and
the enrichment of selection targets for chromatin forming and remodeling factors including
histones, histone-acetyltransferases, histone-deactylases and histone-methyltransferases. We
investigated the nature of the changes driving the sweeps in histones at the individual
aminoacid residue level and showed that they represent a case of parallel evolution in the
two Culex species. Although adaptations are known to be often species- or population- specific,
the tools have been lacking to systematically evaluate the concordance of selection targets, or
alternatively, to point to population-specific genes under selection. We propose to combine
powerful whole genome selection scans in multiple populations with the equally powerful
method of PCA. We performed PCA on Pool-hmm scores and Tajima’s D but it can be as
effectively applied to any selection statistic for comparing selection regimes among
populations, species or taxa. Finally, we presented a pathway-based definition for pleiotropy
and reject the hypothesis of negative effect of pleiotropy on the pace of adaptive evolution by
showing that the number of pathways a gene is involved in is correlated neither with the
number of populations it is positively selected in nor the strength of selection when it
exists. The universality of this conclusion needs to be tested in other settings and model
systems in future works.
The findings of this project were presented in a number of conferences and also published in
Proceedings of Royal Society B:
- Asgharian H., Chang P.L., Lysenkov S., Scobeyeva V., Reisen W.K., Nuzhdin S.V. Recent natural selection in the
Culex pipiens species complex: a comparison across populations. In: 2
nd
SCalE (Southern California Evolutionary
Genetics and Genomics) meeting, University of Southern California, Los Angeles, CA, March 1
st
2014. (Oral
presentation)
- Asgharian H. New methods for the study of population structure and parallel adaptive evolution in populations
from pooled sequencing data. In: 4th International Conference on Proteomics & Bioinformatics, Chicago, Aug 4-6
2014 (accepted abstract).
8
- Asgharian H., Chang P.L., Lysenkov S., Scobeyeva V., Reisen W.K., Nuzhdin S.V. Evolutionary genomics of Culex
pipiens: Global and local adaptations. In: SMBE 2014 (Society for Molecular Biology and Evolution conference),
Puerto Rico, June 8-12, 2014. (Poster)
- Scobeyeva, V., Asgharian H., Chang, P.L., Reisen, W.K., Lysenkov, S., Nuzhdin S.V. Genomic analysis of life cycle
evolution in Culex mosquitos. In: Cuny, G. G. R. (2014). Evolution and diversity of the Chondrichthyes. 5th Meeting
of the European Society for Evolutionary Developmental Biology, Vienna, Austria. (Poster)
-Asgharian, H., Chang, P. L., Lysenkov, S., Scobeyeva, V., Reisen, W. K., Nuzhdin, S. V. 2015. Evolutionary Genomics
of Culex pipiens: Global and Local Adaptations Associated with Climate, Life History Traits and Anthropogenic
Factors. Proceedings of the Royal Society B. 282 (1810): 20150728.
Chapter 3. Toward high resolution population genomics of archaeological samples: A review.
Tatiana Tatarinova was invited to write this review and she kindly invited me to contribute.
DNA remains are our direct windows to the past, like time capsules. They are the ultimate
evidence that can confirm or reject numerous indirect inferences that evolutionary biologists
routinely make about the past from present data. However, sequencing ancient samples and
analyzing the generated data is more complicated than similar attempts on fresh samples due
to the peculiarities of aged DNA including degradation and fragmentation, chemical
modification and contamination. In this paper, we reviewed the advances of the field of
archaeogenetics and discussed how advances in sequencing technologies is allowing
examination of previously intractable samples such as those recovered from hot climates or
very old ones. We addressed the utility of aDNA to answer several questions including the
issues of human past demography, history and evolution and discussed certain under-reviewed
topics such as human diseases of the past and evolutionary epigenomics. We provide
interesting examples of the value of combining aDNA data with genotype-to-phenotype maps
to make inferences about past characters that do not leave fossils such as behavioral traits or
soft tissue properties. In contrast with the other published reviews on aDNA -which often put
the biggest emphasis on the experimental challenges and solutions- we offer and extensive
discussion of special computational (bioinformatic and data analytic) issues associated with
aDNA studies and provide a relatively comprehensive list of currently available tools. This
review has been accepted for publication in DNA Research.
- Morozova, I., Mikheyev, A., Nikolsky, Y., Ponomarenko, P., Flegontov, P., Asgharian, H., Klyuchnikov, V.,
ArunKumar, G., Prokhorchouk, E., Gankin, Y., Rogaev, E., Tatarinova, T.V. (2016) Toward high-resolution population
genomics using archaeological samples. (Accepted for publication in DNA Research)
9
Chapter 4. Inferring the effective number of stem cells, somatic mutation rate and cellular
dynamics of regeneration in asexual Planaria. We are applying population genetic theory and
statistical analysis to study the mechanism of body regeneration in Planarian flatworms. Several
species of Planarians have regained attention in recent years owing to their extraordinary
capacity for reconstructing whole bodies from small tissue fragments - promising to be
extremely informative towards the efforts in regenerative medicine. It is estimated that stem
cells comprise about 30% of their body but details of the regeneration process are largely
unknown. For example, it is not clear if all stem cells or only a fraction of them close to the
wound site participate actively in each round of regeneration; whether different stem cells
contribute almost equally to the growing body; or if there are several long-lived genetically
divergent cell lineages coexisting and being co-inherited in a worm body. Due to unavailability
of transgenes for these species, indecisive molecular markers and lack of a high quality
reference genome, many routine molecular and cell biology techniques cannot be applied to
this system yet; so indirect inference may be necessary.
We have modeled each cell as a separate individual and the body of a worm as a population of
cells; and, are attempting to estimate parameters such as number of stem cells, somatic
mutation rate and certain cellular aspects of regeneration such as proportion of active stem
cells and whether they are chosen randomly or in a structured way from the pool of stem cells.
We tried to estimate the effective number of stem cells based on the temporal variance of
allele frequencies across 16 generations sampled every other generation. Preliminary results
proved that sequence data contain sufficient information to make the desired estimates. The
effective number of stem cells estimated by this method varied substantially from generation
to generation with a harmonic mean around 20 which is remarkably smaller than expected
from microscopic observations. Using our estimates for 𝜋 and 𝑁 𝑒 , and the equation 𝜋 = 2𝑁 𝑒 𝜇 ,
we have calculated a rough lower bound for somatic mutation rate (4.552× 10
−6
). Based on
SNP data we calculated genomic average Tajima’s D to be 2.4. This can be indicative of
balancing selection or a mechanism that guarantees co-inheritance of multiple heterogeneous
cell lineages. This concludes the analytical part of the project.
10
Some of the estimates we have made raise suspicion to the validity of certain assumptions in
our analytical model; for example, a highly positive value of Tajima’s D could mean that the
choice of active (effective) stem cells from the pool of stem cells is not random (an assumption
used for estimation of N e from temporal variance of allele frequencies). It also might mean non-
neutral evolution in which case the equation 𝜋 = 2𝑁 𝑒 𝜇 will be incorrect and will distort our
estimation of mutation rate. There are also concerns about biases associated with the
reference-free method of SNP calling used here due to the unexpected allele frequency
spectrum shape with the large excess of intermediate frequency alleles. Therefore, it seems
necessary to verify the validity of our analytical findings using simulations; and, co-estimate the
set of parameters we are interested in instead of the sequential estimation solution method we
used analytically. We are planning to employ Approximate Bayesian Computation to estimate
the parameters based on flexible scenarios regarding body structure and selection between cell
lineages within a body whose plausibility can be ascertained through model goodness-of-fit
diagnostics. This project yields results to improve our understanding of body regeneration
through stem cells, and emphasizes the utility of out-of-the-box approaches to research when
common methods fail.
The findings of the analytical part of this projects have been presented in The Allied Genetics
Conference (TAGC) 2016 conference:
- Asgharian H., Dunham J., Kitapci T.H., Marjoram, P., Nuzhdin S.V. Estimation of effective number of stem cells in
Dugesia worms using temporal variance of allele frequencies. In: TAGC 2016 meeting, Orlando, FL, July 13
th
-17
th
2016. (Oral presentation)
A manuscript has been submitted to Molecular Biology and Evolution, and is currently under
review.
Note: I am presenting this project as my thesis for the Master’s of Biostatistics program.
However, I thought it to be as relevant to population genetics as it is to biostatistics; and
therefore, decided to include it in this document as well.
11
Chapter 1. Transcriptomic analysis of male-
feminizing Wolbachia infection in the
leafhopper Zyginidia pullula
Summary
Wolbachia causes the feminization of chromosomally male embryos in several species of
crustaceans and insects, including the leafhopper Zyginidia pullula. In contrast to the relatively
well-established ecological aspects of male feminization (e.g. sex ratio distortion and its
consequences), the underlying molecular mechanisms remain understudied and unclear. We
embarked on an exploratory study to investigate the extent and nature of Wolbachia’s effect
on gene expression pattern in Z. pullula. We sequenced whole transcriptomes from Wolbachia-
infected and uninfected adults. 18147 loci were assembled de novo, including homologs of
several Drosophila sex determination genes. A number of transcripts were flagged as candidate
Wolbachia sequences. Despite the resemblance of Wolbachia-infected chromosomal males to
uninfected and infected chromosomal females in terms of sexual morphology and behavior,
principal component analysis revealed that gene expression patterns did not follow these
sexual phenotype categories. The principal components generated by differentially expressed
genes specified a strong sex-independent Wolbachia effect, followed by a weaker Wolbachia-
sexual karyotype interaction effect. Approaches to further examine the molecular mechanism
of Wolbachia-host interactions have been suggested based on the presented findings.
Keywords: Wolbachia infection, male feminization, Principal Component Analysis (PCA),
Zyginidia pullula transcriptome, transcriptome de novo assembly, host-symbiont interactions
12
1.1. Introduction
Wolbachia is an intracellular symbiont alpha-proteobacterium that infects a wide range of
arthropods and nematodes (Werren et al. 2008; Schulenburg et al. 2000). It is often
transmitted vertically from females through the eggs to their future progeny; although,
horizontal transfer between hosts has also been documented (Cordaux et al. 2001; Werren et
al. 1995). Studying the mechanism of Wolbachia-host interactions is fascinating for many
reasons. Wolbachia is capable of inducing several intriguing sex-related phenotypes in its
hosts, including male killing (MK), in which infected males die during embryonic or larval stages;
male feminization (MF), that is the development of genetic males into females; thelytokous
parthenogenesis (TP) in which infected virgin females produce daughters. All of these
phenotypes distort the progeny sex ratio in favor of females thus ensuring higher transmission
rate of Wolbachia to the next generation of hosts (Werren et al. 2008; White et al. 2013).
Another fascinating effect of the infection is the cytoplasmic incompatibility between gametes
(CI), which results in an aberrant or considerably reduced offspring production, if uninfected
females mate with infected males, or if the parents are infected with different Wolbachia
strains (Werren et al. 2008; White et al. 2013). In this case, infected females possess a
reproductive advantage compared to uninfected ones, and this again ensures the spreading of
Wolbachia into the host population. Fast transition between the four phenotypes in the course
of the coevolution of Wolbachia and its hosts hints that similar molecular mechanisms might
underlie the apparently different effects (Ma et al. 2014). Due to its enormous host range,
Wolbachia may have played a crucial role in the evolution of sex determination system and
reproductive strategies in arthropods (Cordaux et al. 2011; Awrahman et al. 2014; Ma et al.
2014).
Various approaches have been employed to investigate the Wolbachia-host interactions in
naturally infected and uninfected strains (Negri et al. 2006; Riparbelli et al. 2012; Hoffmann et
al. 1990), experimentally inoculated cell lines (Xi et al. 2008; Noda et al. 2002), and antibiotic
13
treated specimens (Casiraghi et al. 2002; Hoffmann et al. 1990). Although Wolbachia is an
obligate intracellular symbiont natuarally, protocols have been developed to keep it viable in
cell-free media for days; however, no replication occurs in the extracellular phase (Rasgon et al.
2006; Gamston and Rasgon 2007). The experimental/analytical techniques comprised a wide
range including classical crossing and fecundity measurements (e.g. Hoffmann et al. 1990; Dunn
et al. 2006), microscopic approaches (in situ hybridizations, electron microscope and
immunohistochemical techniques for bacterium detection inside hosts and cells tissues, etc.)
(e.g. Negri et al. 2008; Fischer et al. 2011), gene expression analysis (e.g. Xi et al. 2008; Hughes
et al. 2011; Chevalier et al. 2012; Liu et al. 2014; Darby et al. 2012; Kremer et al. 2012; Kremer
et al. 2009), bioinformatic genome sequence annotation and functional prediction (e.g. Foster
et al. 2005; Wu et al. 2004; Klasson et al. 2008), and mathematical modeling of the ecological
consequences of CI or sex ratio distortion (e.g. Turelli 1994; Taylor 1990). Despite all these
efforts, a coherent mechanistic story of Wolbachia’s effect is still lacking. The picture is
incomplete even for CI which occurs in Drosophila and is the most extensively studied
Wolbachia-induced phenotype; although, cytoskeleton reorganization and asynchrony in
nuclear envelope break down and chromosomal condensation of male and female pronuclei
after fertilization have been implicated in the process (Serbus et al. 2008; Werren et al. 2008).
The other three phenomena are less well understood. TP seems to result from induction of
diploidy in species with a haplodiploid sex determination system by production and
development of diploid eggs; that is achieved by altering meiosis to produce diploid gametes
(Weeks and Breeuwer 2001), the abortion of the first mitotic division after chromosomal
duplication (Pannebakker et al. 2004), or the fusion of the two haploid nuclei after first mitosis
of induced eggs (Gottlieb et al. 2002). The molecular bases of MK and MF are least understood
but they are suspected to share certain components as MK is often the result of a lethal and
incomplete attempt at feminization of genetic male embryos (Werren et al. 2008). The most
direct mechanistic evidence comes from the study of male killing Wolbachia in the moth
Ostrinia scapulalis showing that it overrides the karyotypic signal in genetic males to produce
the female dsx isoform (Sugimoto and Ishikawa 2012). This suggests that Wolbachia impacts
the sex determination pathway at or above dsx. Apart from this direct effect on the pivotal sex
14
determining gene dsx, MK or MF Wolbachia infection is reported to be accompanied with
defective chromatin remodeling (Riparbelli et al. 2012), induction of host immune response
(Chevalier et al. 2012), and epigenetic reprogramming of the host (Negri, Franchini, et al. 2009).
Zyginidia pullula is a leafhopper with XX/XO male heterogametic sex determination system in
which Wolbachia causes feminization of chromosomal males (Negri et al. 2006). Infected
female leafhoppers are morphologically indistinguishable from uninfected females; but
feminized chromosomal males have an intersex phenotype i.e. they have the upper pygofer
appendages, a typical male secondary sexual feature. These appendages show varying degrees
of development, from being fully developed in some specimens to being a barely recognizable
stump in others (Negri et al. 2006). Feminized males with upper pygofer appendages reduced
to a stump have ovaries morphologically similar to uninfected females, whereas those with
prominent appendages possess malformed and probably less functional ovaries (Negri et al.
2008). The “degree of feminization” has been shown to be correlated with Wolbachia density
in the host tissues in several systems (Jaenike 2009). We have previously reported that
Wolbachia instigates epigenetic reprogramming of Z. pullula (Negri, Franchini, et al. 2009;
Negri, Mazzoglio, et al. 2009) and probably interacts with the insect hormone biosynthesis
pathway to stimulate the production of feminizing hormones (Negri et al. 2010; Negri 2012). In
this study, whole transcriptomes of male and female Zygindia samples (Wolbachia-infected and
uninfected) were analyzed with Illumina deep sequencing technique, in order to understand
the scope and nature of the Wolbachia-induced change in the host gene expression profile.
Our initial idea was that if male feminization is the main consequence of Wolbachia infection,
transcriptomes from the three female types (uninfected females, infected females and
feminized males) should resemble each other and be different from the only phenotypically
male group (uninfected males). In fact, we decided to test the hypothesis that sex reversal is
Wolbachia’s main effect at the transcriptome level. Were this confirmed, we would proceed to
identify differentially expressed genes between the two sexual phenotype groups.
15
1.2. Methods
1.2.1. Zyginidia specimens
34 overwintering females of Z. pullula were collected in the same grass field in north Italy; and
were reared individually in the laboratory as described in (Negri et al. 2006). Overwintering
females have often mated with several males (rarely with only one). By carefully examining the
progeny, Wolbachia-infected (i.e. all female brood) and uninfected (i.e. male and female brood)
lines were identified. Wolbachia infection was then confirmed by PCR on the mothers and
randomly chosen samples from the brood as described in (Negri et al. 2006). Morphological
investigation as to the presence or absence of upper pygofer appendages lead us to separate
feminized males from genetic females in the all-female (i.e. Wolbachia-infected) lines, and
males and females in the uninfected lines. Males from uninfected lines were mated to the
physiologically female progeny of the infected lines (consisting of genetic females and males) at
each generation to produce the next generation of infected females (and feminized males).
This backcrossing to uninfected males was done for at least three generations in the lab. Fifty
adults from each of the four different categories of uninfected females (F), uninfected males
(M), infected females (FW) and feminized (infected) males (MW) were pooled together for RNA
sequencing.
1.2.2. cDNA library preparation and short-read sequencing
cDNA libraries were made from male and female specimens of infected and uninfected
leafhopper lines. Infected males are phenotypically intersex and exhibit different degrees of
feminization depending on the concentration of Wolbachia, ranging from individuals with
functional ovaries to individuals with female secondary sexual characters, but possessing testes.
We used thoroughly feminized infected males for RNA extraction. RNA purification, cDNA
synthesis and Illumina library construction were performed using the protocols of (Mortazavi et
al. 2008), with the following modifications: total RNA, mRNA and DNA were quantified using a
16
Qubit fluorometer (Invitrogen); mRNA fragmentation was performed using Fragmentation
Reagent (Ambion) for a 3 minute and 50 second incubation at 70°C and subsequently cleaned
through an RNA cleanup kit (Zymo Research); additional DNA and gel purification steps were
conducted using Clean and Concentrator kits (Zymo Research). Each sample library was
sequenced as pair-ended 76-base reads on an Illumina Genome Analyzer II.
1.2.3. de novo transcriptome assembly and expression level calculation
Due to the sensitive nature of de novo assembly, it is critical that the reads used to generate
contigs have the highest sequencing quality. Reads were removed from consideration in the de
novo assembly if they had a terminal phred (Ewing and Green 1998) quality value less than 15,
or contained more than 2 unknown nucleotides (i.e. N). Reads were also filtered due to
similarities to known PCR primer and Illumina Adapter sequences. Using the reads pooled from
all of the four samples that were not filtered out, the de novo assembly program Velvet (version
1.0.15) (Zerbino and Birney 2008) was used in conjunction with a custom post-processing
algorithm capable of retaining information from alternative splices (Sze et al. 2012) to assemble
short reads into contigs, using sequence overlap information until the contigs can no longer be
extended. Velvet was run under the following settings with a kmer length of 35: -cov_cutoff
auto -max_branch_length 0 -max_divergence 0 -max_gap_count 0 -read_trkg yes. Sequenced
reads that were kept as pairs and not filtered out together or separately were treated as "-
shortPaired" with insert length of 175 bases and standard deviation of 75 bases. Single end
reads that were not filtered out were treated as "-short".
With the set of de novo sequences serving as a reference, reads from each of the individual
samples were mapped using the Burrows-Wheeler Aligner (BWA) (Li and Durbin 2009). The
number of reads that mapped to the contigs of each gene was tabulated and normalized to
calculate FPKM (Fragments Per Kilobase Of Exon Per Million Fragments Mapped). Additional
normalization among all samples was performed using the TMM protocol (Trimmed Mean of
M-values) outlined in (Robinson and Oshlack 2010), which takes into account differences in
17
overall RNA populations across samples and is one of several methods used to evaluate RNA
sequencing data. Normalization was implemented using the edgeR package in R (Robinson et
al. 2010). All statistical analyses and graphs evaluating consistency between samples were
produced using R v2.13.0 (R Development Core team 2011).
1.2.4. Gene functional annotation and classification
Blast2GO v.2 (Götz et al. 2008) and WEGO (Ye et al. 2006) were used to obtain Gene Ontology
(GO) annotations. Genes were also annotated using a BLASTX search (Altschup et al. 1990)
(Expected value <1.00e-05) to the nr protein database available from GenBank as well as to the
set of protein sequences available from the Drosophila melanogaster 5.34 and the pea aphid
Acyrthosiphon pisum 2.1 releases. We chose the annotation with the highest BLAST score as
long as the span of the alignment was greater than 80% of the length of the contig under query.
For genes that did not report any hits, we lowered the minimum span to 40% of the length,
choosing the annotation with the highest BLAST score having Expected value <1.00e-05.
1.2.5. Principal component analysis of gene expression values
Expression values were cleaned of extreme outliers, quartile-normalized and log-transformed
before they were used for PCA. To make sure the result were not artifacts of the data
preparation method, PCA was repeated on the raw (not normalized, not log-transformed)
expression values as well as after several different outlier-filtering and normalization strategies.
These statistical procedures were done in SAS 9.3.
18
1.3. Results
1.3.1. Short-read sequencing and de novo assembly
The mRNA population was analysed with Illumina deep sequencing of male and female Zygindia
samples with and without Wolbachia infection. The pooled data from all samples had a total of
50M pair-ended reads that were 76 bases long. All Illumina sequences are available for
download at the NCBI Short Read Archive under the BioProject PRJNA171390. After sequences
were filtered based on quality and matches to adapter and primer sequences, the 38M reads
from all four samples were pooled together and run through Velvet and the post-processing
algorithm. Eventually, 18,147 loci and a total of 27,236 transcripts were assembled; multiple
transcripts of a locus pertained often to different splicing isoforms and occasionally to largely
differentiated alleles. The transcripts ranged in lengths from 291bp to 15,389bp, with mean
and median lengths of 1,006bp and 702 bp, respectively. This assembly included a fairly large
number of long transcripts: 25% were longer than 1,250bp and 10% were longer than 2,000bp.
Of the 18,147 loci, 14,068 (77.5%) had a single isoform and the remaining 22.5% had multiple
ones. Transcripts within a locus were subsequently collapsed into a single "representative locus
sequence" by using ClustalW to run a multiple sequence alignment and identifying the locus
consensus sequence. Mean and median lengths of consensus sequences were 900bp and
618bp, respectively. The total length of all loci consensus sequences was 16.3Mb.
1.3.2. Gene functional annotation and classification
6,946 loci, corresponding to 38% of the entire dataset, were Gene Ontology annotated with
Blast2GO. The consensus sequences were also aligned using a BLASTX search to the nr protein
database available from GenBank as well as to the set of protein sequences available from
Flybase and the aphid genome. Table 1.1 shows the proportion of cases that resulted in a hit
where the length of the alignment was greater than 80% or 40% of the length of the query
19
(leafhopper sequence). One might very crudely attribute the 80% alignment span hits to true
genic homology and the 40% alignment span hits to conserved domains.
Table 1.1. Summary statistics of Zyginidia transcripts homology search. 40% homology length indicates that the
length of the homologous segment covers at least 40% of the query (leafhopper) sequence. A correspondingly
similar definition applies to the 80% homology length category. The percent values in the table cells show what
percent out of all loci (18147) fit each criterion when Blasted against the designated dataset.
40% homology length 80% homology length
Genbank 60% 39%
Flybase 48% 26%
Pea Aphid 81% 32%
A number of genes potentially involved in the leafhopper sex determination were identified
through homology search with the Drosophila sex determination genes. Although pea aphid is
Zyginidia’s closest relative with a reference genome sequence (The International Aphid
Genomics Consortium 2010), the functional annotation for this genome is not as complete as
that of Drosophila. Sex determining genes of pea aphid have been found based on homology
with Drosophila sequences and lack direct experimental verification (The International Aphid
Genomics Consortium 2010). Therefore, we decided to use Drosophila sequences as the
reference set. Fig. 1.1 depicts the canonical sex determination pathway in Drosophila.
Homologs of several Drosophila sex determination genes were identified among the transcripts
including dsx (doublesex), tra-2 (transformer-2), vir (virilizer), fl(2)d (female lethal d), snf (sans
fille) and Ix (intersex). No leafhopper homologs could be identified for tra (transformer), sxl (sex
lethal), fru (fruitless) or her (hermaphrodite). Table 1.2 shows the expression levels for the
identified leafhopper sex determination genes.
20
Figure 1.1. Sex determination pathway in Drosophila, from (Sánchez, 2008). SxlF and SxlM refer to functional
female and nonfunctional male isoforms of the Sxl protein, respectively. TraF is the functional female form of
the Transformer protein, which in conjunction with the constitutive gene product Tra-2 controls female-specific
splicing of Dsx and Fru. Snf, Vir and FL(2)D are required for late female-specific splicing of Sxl but play no part in
determining early Sxl splicing pattern. The genes for which Z. pullula homologs have been identified in this
study, are boxed in grey. For more details on the regulation and function of these genes, refer to (Gempe and
Beye, 2011; Sánchez, 2008) or other similar resources.
Table 1.2. Homologs of Drosophila sex determination genes in the Zyginidia transcriptome and their normalized
expression levels. FW: infected female, MW: infected feminized male (intersex female), F: uninfected female, M:
uninfected male.
Locus Fly homolog Multiple isoforms? FW MW F M
5652 Dsx Y 11.6 49.4 11.7 19.7
8743 tra-2 Y 5.9 13.2 8.8 26.1
5015 Vir N 20.2 9.9 16.1 23.2
10229 fl(2)d N 17.4 32.9 40.9 63.3
21060 Snf Y 0 3.3 20.5 10.9
18743 Ix N 5.9 11.6 81.8 43.9
21
Seventeen genes in our dataset were flagged as likely Wolbachia sequences according to the
Blast results against the NCBI dataset. Bacterial origin seems very probable for a number of
these transcripts based on the expression levels in infected and uninfected lines, plus high
similarity to known Wolbachia sequences (Table 1.3). These sequences were Blasted against
the aphid genome to check if there was an indication of horizontal transfer; they were also
Blasted against the Drosophila genome as a distant outgroup (Table 1.3).
1.3.3. Principal component analysis of transcriptomes
Principal component analysis on transcriptomes of the four leafhopper samples surprisingly
revealed that Wolbachia infection changes the host transcriptome extensively and the effect is
by no means limited to sex-reversal. As evident in Fig. 1.2, the first PC (explaining 66.46% of
variance) is highly correlated with all of the samples indicating that the expression of most
genes is not significantly altered by Wolbachia and is similar across all samples. By the second
PC (explaining 20.36% of variance), Wolbachia infected male and female samples cluster
together and uninfected male and female cluster together. This PC is generated by genes
whose expression is changed by Wolbachia consistently regardless of sexual karyotype or
phenotype. The third PC (explaining 7.97% of variance) indicates an interaction term: F and M
are similar and stand in the middle of the scale, with MW and FW occupying the opposite sides
of them. This PC is generated by genes that are expressed similarly in uninfected males and
females, and Wolbachia infection changes their expression in opposite ways in chromosomal
males and females. Overall, sex inversion does not seem to be the only or even the biggest
effect of Wolbachia on gene expression patterns in Zyginidia, even if it is the most conspicuous
phenotypic consequence; otherwise, we would expect the three phenotypically female groups
(F, FW and MW) to cluster together and the only male group (M) to stand separate from them.
None of the PCs show such a pattern. PCA was repeated on expression values without the
initial outlier filtering, and applying several different normalization and transformation
strategies; they all yielded the same picture as described above: the main effect was invariably
the presence or absence of Wolbachia regardless of sex (details not shown).
22
Figure 1.2. Principal component analysis of gene expression levels in four Zyginidia samples: infected females
(FW), infected males or intersex females (MW), uninfected females (F) and uninfected males (M). The first three
PCs are depicted here. PC1 is correlated with all sample labels but PC2 separates the infected and uninfected
samples conclusively. PC3 reveals an interaction term between Wolbachia status and sexual karyotype. The
“_ql” suffix after line names means that the expression values were quantile-normalized and log transformed.
23
Table 1.3. Expression levels and aphid and fly homologs of loci whose best Genbank hit was a Wolbachia sequence. Aphid and fly homologs were
identified by Blasting the sequence on Genbank using the species inclusion option. Aphid hits were double-checked by Blasting on Aphidbase.com.
Accession number for aphid homologs are from Aphidbase. Other accession numbers are from Genbank. A hit is reported only if the Expected value is <1E-
5. Q%: Query coverage percent; Id%: Identity percent; E: Expected value.
Expression Genbank Aphid (Acyrthosiphon pisum) homolog Fly (Drosophila melanogaster) homolog
Locus FW MW F M Best hit Accession Q
%
Id
%
E Accession Q
%
Id
%
E Accession Q
%
Id
%
E
1053 146.6 138.2 0 0.2 GroEL [Wolbachia endosymbiont of Bemisia
tabaci]
AFQ62607.1 62 93 5E-142 ACYPI009253 57 37 2E-35 CAA67720.1 57 39 1E-37
1097 51.8 37.9 1.5 0.6 Outer surface protein precursor [Wolbachia
endosymbiont of Nasonia vitripennis]
ABF61215.1 88 83 8E-47 - - - - - - - -
1331 60.4 36.2 0 0 Molecular chaperone GroeL [Wolbachia
endosymbiont of Nasonia vitripennis]
WP_010401204.1 99 99 9E-75 ACYPI009253 99 44 7E-24 CAA67720.1 97 48 4E-26
13961 34.6 29.6 1.5 1.0 30S ribosomal protein S12 [Wolbachia
endosymbiont of Culex quinquefasciatus]
WP_007301994.1 93 100 2E-59 ACYPI007219 83 53 2E-24 NP_525050.1 75 52 4E-20
22635 0 8.3 40.9 55.1 Ankyrin motif protein [Wolbachia endosymbiont
of Cadra cautella]
BAH22252.1 62 49 6E-92 ACYPI21453 31 33 1E-09 NP_648366.2 24 31 3E-09
23163 0 3.3 0 17.4 Ankyrin domain protein [Wolbachia pipientis] AEX55220.1 93 36 2E-26 ACYPI38268 88 33 9E-21 NP_787123.1 87 31 8E-23
2326 11.6 14.8 1.5 4.1 SD27140p [Wolbachia endosymbiont of
Drosophila ananassae]*
WP_007550719.1 53 58 2E-26 ACYPI56754 98 43 1E-29 CAA09069.1 93 40 1E-22
2673 86.3 39.5 0 0.3 Hypothetical protein [Wolbachia pipientis] WP_006013682.1 98 68 4E-55 - - - - - - - -
26930 0 9.9 22.0 5.0 Ankyrin repeat domain protein [Wolbachia
endosymbiont of Nasonia vitripennis]
WP_010405254.1 61 44 1E-13 ACYPI38268 65 44 9E-14 NP_001246787.1 60 42 3E-14
24
276 175.4 187.6 17.6 7.6 Unnamed protein product [Wolbachia
endosymbiont of Callosobruchus chinensis]
BAC22720.1 65 41 4E-49 ACYPI28967 19 40 4E-12 CAC16870.1 64 29 3E-31
2876 0 6.6 7.3 7.9 MULTISPECIES: ankyrin [Wolbachia]** WP_007302786.1 99 35 2E-19 ACYPI000387 99 35 1E-19 AHN57996.1 99 33 5E-19
295 0 19.8 0 0 Hypothetical protein [Wolbachia pipientis]*** WP_019236989.1 46 99 6E-103 - - - - - - - -
3433 69.1 23.1 0 0 Hypothetical protein [Wolbachia endosymbiont
of Cadra cautella]
BAH22204.1 66 84 1E-24 - - - - - - - -
345 71.9 13.2 3.0 0 Ankyrin motif protein [Wolbachia endosymbiont
of Cadra cautella]
BAH22317.1 100 47 5E-17 ACYPI001311 95 35 1E-08 NP_608900.3 78 34 9E-08
4382 51.8 39.5 0 0 Cold-shock domain family protein [Wolbachia
endosymbiont of Drosophila simulans]
WP_015588193.1 50 99 4E-40 ACYPI000791 40 45 2E-11 AAL28370.1 40 47 1E-10
5403 51.8 32.9 7.3 1.2 Hypothetical protein [Wolbachia endosymbiont
of Drosophila simulans]
WP_015588107.1 99 97 1E-62 - - - - - - - -
5490 26.0 46.1 0 0 Heme biosynthesis protein HemY [Wolbachia
pipientis]
WP_006014258.1 99 96 5E-60 - - - - NP_651267.1 41 35 1E-07
* The second best hit is the reported Wolbachia sequence. The first hit is >ref|XP_005192252.1| PREDICTED: uncharacterized protein LOC101899042, partial [Musca domestica].
** The second best hit is the reported Wolbachia sequence. The first hit is >ref|WP_012472427.1| hypothetical protein [Candidatus Amoebophilus asiaticus].
*** Although the best hit the reported hypothetical protein, the second best hit and many after that are annotated as (putative) transposases.
25
1.4. Discussion
We assembled the Z. pullula transcriptome de novo and produced 18,147 loci and 27,236
transcripts with a total consensus sequence length of 16.3Mb. These numbers were well within
the expected range based on the aphid genome information. The aphid genome was reported
to contain 11,089 highly supported RefSeq gene models with a total exonic length of 21.6Mb;
adding the gene models from six other gene prediction programs, a total of 34,604 non-
redundant gene models with the total exonic length of 35.7Mb were described (The
International Aphid Genome Consortium 2010). The true number of genes is purportedly a
number between those two estimates. Hence, our de novo assembly of the transcriptome
seems to have captured a reasonable proportion of the expressed genes.
The results of sequence homology search (Table 1.1) confirm the closer relatedness of Z. pullula
to A. pisum (the aphid) than to Drosophila. A caveat to this analysis is the extensive set of
duplications in the aphid genome (The international Aphid Genome Consortium 2010).
Without a leafhopper reference genome, we do not know if the same wave of duplications has
affected Z. pullula or not; however, there was an indication in our data that it might have. By
visual inspection of the sequences that were annotated as isoforms of a single locus
computationally, we realized that some of them did not show signatures of known alternative
splicing patterns; but looked like highly differentiated alleles (details not shown). These may
indeed be paralogous sequences in the process of divergence. Further investigation, including
the sequencing of single individuals rather than pools of them, will be required to separate
paralogy from allelic variation.
A number of leafhopper sex determination genes were identified based on homology with fly
sequences (Table 1.2). Insect sex determination machinery has evolved around the
transformer-doublesex axis (Sánchez 2008); tra is the fast evolving component responsible for
26
receiving the signal –sometimes through mediators- from the upstream sex determining factors
(chromosomal constituent, incubation temperature, etc.), and dsx is the conserved switch that
relays this signal down to the developmental processes (Verhulst et al. 2010; Sánchez 2008). It
is, therefore, not surprising that we found a homolog for dsx and not for tra in our dataset. The
short length of the aligned segments prevented reliable assignment of male and female
isoforms; but these initial results can be used to design primers to extract the whole genes from
the leafhopper genome. Future experiments can then follow the flow of the signal in the sex
determination pathway to identify where the cascade is diverted to female development in
Wolbachia-infected genetic males. In the moth O. scapulalis , the impact point is somewhere
above the level of dsx (Sugimoto and Ishikawa 2012). Having the sequences of dsx male and
female isoforms, one could check whether this is also true in leafhoppers. Unfortunately, the
lack of replicates in our preliminary data makes it impossible to assess the significance of
differential expression of genes across our four groups (FW, MW, F and M). This is another task
that remains to be done in future projects. In addition, development of X-linked sequence
markers will enable early sexing of the embryos (based on the female XX / male XO karyotypes)
through quantitative PCR; and facilitate the study of early developmental processes in infected
and uninfected specimens.
We found a number of Wolbachia-related transcripts in the sequenced cDNA libraries (Table
1.3). The loci expressed mainly in infected lines with great similarity to known Wolbachia
sequences are likely to have Wolbachia origin (e.g. loci 1053, 1097, 1331 and 13961). Curiously,
a couple of loci are expressed primarily in the uninfected lines (e.g. locus 22635). At this point,
we do not have a hypothesis as to the reason behind this observation. Repeating the
experiments with replicates and higher sequencing depth would be the first step to confirm the
reproducibility of these patterns. Our protocol of mRNA purification for creation of cDNA
libraries involved a hybridization step with oligo-T ligands, which targets the eukaryotic mRNA
poly-A tails; therefore, it will be necessary to employ a different purification strategy in order to
capture most of the poly-A lacking bacterial mRNAs. Table 1.3 shows that several of the
Wolbachia-related sequences code for Ankyrin-repeat proteins. Wolbachia genomes are well
27
known for containing an extraordinarily high number of these genes (Wu et al. 2004; Iturbe-
Ormaetxe et al. 2005). Gene transfer between Wolbachia and mosquito hosts has been
previously reported (Woolfit et al. 2009). PCR experiments and phylogenetic analyses have
confirmed horizontal gene transfer from bacterial endosymbionts to the aphid genome
(Consortium 2010). Similar approaches will be required to confirm bacterial or insect origin for
the transcripts listed in Table 1.3. We tried to check for possible aphid lineage-specific
horizontal transfers by asking whether a likely Wolbachia transcript shows high sequence
similarity to an aphid sequence, but not a fly sequence; none of the loci in Table 1.3 expressed
such a pattern. One of the Wolbachia-related transcripts showed a degree of homology with
the aphid vasa gene (locus 4382). Almost identical homologs of this sequence exist in the three
published Wolbachia genomes (Blast results not shown); its homologs in fly, leafhopper and the
published Wolbachia genomes are characterized or predicted ATP-dependent RNA helicases.
vasa has been implicated in transmission of maternal effects and sex determination in clams
(Milani et al. 2011). It will be very interesting to check if products of host-homologous genes
are actually exported out by Wolbachia into the host cell.
We used natural isolates of infected and uninfected leafhoppers for our comparisons with no
antibiotic treatment. This relieved our comparisons from the confounding effects of antibiotic
treatments on the host physiology. The rationale behind the traditional use of antibiotics to
cure the infected lines from Wolbachia is to obtain infected and uninfected lines with the same
genetic background. However, antibiotics can change the host physiology substantially, and
quite remarkably, their effect can perpetuate through several generations of unexposed
progeny (Zeh et al. 2012; Ballard and Melvin 2007; Fridmann-Sirkis et al. 2014). We avoided the
use of antibiotics completely and achieved homogenous genetic backgrounds among samples
by taking advantage of repeated backcrossing of infected females to uninfected males. We
collected all of our founder specimens from the same leafhopper population in a grass field. In
the sampled population, the sex-ratio was only moderately female biased, with a moderate
prevalence of the infection (~1:1.8 male:female, Wolbachia infection rate ~30% of the collected
females; Negri I., unpublished data). As uninfected males are the only physiological males in
28
existence, all the “egg-laying females” (in the field and in the lab, including the females used in
this study) always mate with (and only with) uninfected males. Thus, all of our infected and
uninfected lines come from the same genetic background. We carried out three further
generations of backcrossing of infected females to uninfected males in the lab to effectively
remove any residual genetic variation between the two groups. Details of rearing conditions
are described in (Negri et al. 2006). The natural pattern of sexual reproduction and the
additional backcrossing done in the lab ensure the similarity of nuclear genetic backgrounds.
We also tested mitochondrial gene sequences in Zyginidia samples from different Italian
localities, both infected and uninfected, and they were all nearly identical (Negri I., unpublished
data).
Through principal component analysis, we have showed that Wolbachia-induced changes in the
host transcriptome are mainly sex-independent, and cannot be explained only by the sex
reversal of genetic males. Previous transcriptomic studies on Wolbachia have reported changes
in the expression of genes unrelated to the reproductive phenotype. For instance, Wolbachia
infection in Armadillidium vulgare triggered the overexpression of immune-related genes
(Chevalier et al. 2012). In the parasitoid wasp Asobara tabida, endosymbiont infection or lack
thereof was associated with changes in expression of genes related to female reproductive
development, iron and oxidative stress regulation, and immune recognition (Kremer et al. 2012;
Kremer et al. 2009). Artificial infection of Anopheles cell cultures by Wolbachia, surprisingly
caused down-regulation of immune, stress response and detoxification genes (Hughes et al.
2011). Wolbachia-inoculated Drosophila cell lines exhibited differential expression of several
GO categories not directly related to reproduction, including antimicrobial humoral response,
ion homeostasis, response to unfolded protein and response to chemical stimulus (Xi et al.
2008). In Aedes aegypti, Wolbachia was shown to manipulate the expression of a
metalloprotease gene through induction of a specific host miRNA (Hussain et al. 2011). Apart
from such direct evidence, the observation of various forms of fitness cost in the feminized
males, is consistent with the idea that sex reversal is not the sole effect of feminizing Wolbachia
(Rigaud and Moreau 2004; Moreau et al. 2001). Nevertheless, our study is the first one to
29
quantitatively demonstrate that infection itself has a larger effect than that of sex reversion,
through PCA of all of the available gene expression levels.
Lack of replicates meant that we could not quantitatively identify differentially expressed genes
between the lines because we could not calculate variances. Instead, we focused on the global
patterns of gene expression by applying PCA to gene expression values. Thousands of loci (each
acting as one observation point) were used to generate the PCs. Antibiotic treatment and
different genetic backgrounds could have been two potential sources of systematic bias in this
type of analysis; they could have generated similar clustering patterns and confounded the
interpretation of results. However, through the single-population sampling and the repeated
backcrossing scheme, we avoided both sources of confusion.
Based on the PCA results, we encourage the use of biochemical bottom-up approaches focusing
on the whole Wolbachia effect rather than the specific sex inversion event. Wolbachia’s effect
is perceivably mediated by molecules secreted into the host cell or expressed on the outer
membrane surface of the bacterium-containing vesicles. Wolbachia cannot be maintained in
cell-free cultures indefinitely; but there are protocols to keep them alive in synthetic media for
several hours (Rasgon et al. 2006; Gamston and Rasgon 2007). In such a setting, the molecules
released into the medium can be detected and purified using chromatographic and/or mass
spectrometric approaches. Appropriate methods can be used, too, for isolation and
characterization of surface molecules from the bacterium-containing vesicles. Pull-down
experiments on the host proteins by these Wolbachia released or surface molecules might
reveal the initial cellular targets of the endosymbiont-host interaction.
30
References
Altschup, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic Local
Alignment Search Tool. Journal of Molecular Biology, 215, 403–410.
Awrahman, Z. A., Champion de Crespigny, F., and Wedell, N. (2014). The impact of Wolbachia,
male age and mating history on cytoplasmic incompatibility and sperm transfer in
Drosophila simulans. Journal of Evolutionary Biology, 27, 1–10. doi:10.1111/jeb.12270.
Ballard, J. W. O., and Melvin, R. G. (2007). Tetracycline treatment influences mitochondrial
metabolism and mtDNA density two generations after treatment in Drosophila. Insect
Molecular Biology, 16(6), 799–802. doi:10.1111/j.1365-2583.2007.00760.x.
Casiraghi, M., McCall, J. W., Simoncini, L., Kramer, L. H., Sacchi, L., Genchi, C., et al. (2002).
Tetracycline treatment and sex-ratio distortion: a role for Wolbachia in the moulting of
filarial nematodes? International Journal for Parasitology, 32(12), 1457–68.
Chevalier, F., Herbinière-Gaboreau, J., Charif, D., Mitta, G., Gavory, F., Wincker, P., et al. (2012).
Feminizing Wolbachia: a transcriptomics approach with insights on the immune response
genes in Armadillidium vulgare. BMC Microbiology, 12(Suppl 1), S1. doi:10.1186/1471-
2180-12-S1-S1.
Cordaux, R., Bouchon, D., and Grève, P. (2011). The impact of endosymbionts on the evolution
of host sex-determination mechanisms. Trends in Genetics, 27(8), 332–41.
doi:10.1016/j.tig.2011.05.002.
Cordaux, R., Michel-Salzat, A., and Bouchon, D. (2001). Wolbachia infection in crustaceans:
novel hosts and potential routes for horizontal transmission. Journal of Evolutionary
Biology, 14(2), 237–243. doi:10.1046/j.1420-9101.2001.00279.x.
Darby, A. C., Armstrong, S. D., Bah, G. S., Kaur, G., Hughes, M. A, Kay, S. M., et al. (2012).
Analysis of gene expression from the Wolbachia genome of a filarial nematode supports
both metabolic and defensive roles within the symbiosis. Genome Research, 22(12), 2467–
77. doi:10.1101/gr.138420.112.
Dunn, A. M., Andrews, T., Ingrey, H., Riley, J., and Wedell, N. (2006). Strategic sperm allocation
under parasitic sex-ratio distortion. Biology Letters, 2(1), 78–80.
doi:10.1098/rsbl.2005.0402.
Ewing, B., and Green, P. (1998). Base-calling of automated sequencer traces using phred. II.
Error probabilities. Genome Research, 8(3), 186–94.
31
Fischer, K., Beatty, W. L., Jiang, D., Weil, G. J., and Fischer, P. U. (2011). Tissue and stage-specific
distribution of Wolbachia in Brugia malayi. PLoS Neglected Tropical Diseases, 5(5), e1174.
doi:10.1371/journal.pntd.0001174.
Foster, J., Ganatra, M., Kamal, I., Ware, J., Makarova, K., Ivanova, N., et al. (2005). The
Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic
nematode. PLoS Biology, 3(4), e121. doi:10.1371/journal.pbio.0030121.
Fridmann-Sirkis, Y., Stern, S., Elgart, M., Galili, M., Zeisel, A., Shental, N., et al. (2014). Delayed
development induced by toxicity to the host can be inherited by a bacterial-dependent,
transgenerational effect. Frontiers in Genetics, 5, 27. doi:10.3389/fgene.2014.00027.
Gamston, C., and Rasgon, J. (2007). Maintaining Wolbachia in cell-free medium. Journal of
Visualized Experiments, 5, 223. doi:10.3791/223.
Gempe, T., and Beye, M. (2011). Function and evolution of sex determination mechanisms,
genes and pathways in insects. BioEssays, 33(1), 52–60. doi:10.1002/bies.201000043.
Gottlieb, Y., Zchori-Fein, E., Werren, J. H., and Karr, T. L. (2002). Diploidy restoration in
Wolbachia-infected Muscidifurax uniraptor (Hymenoptera: Pteromalidae). Journal of
Invertebrate Pathology, 81(3), 166–74.
Götz, S., García-Gómez, J. M., Terol, J., Williams, T. D., Nagaraj, S. H., Nueda, M. J., et al. (2008).
High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic
Acids Research, 36(10), 3420–35. doi:10.1093/nar/gkn176.
Hoffmann, A. A., Turelli, M., and Harshman, L. G. (1990). Factors affecting the distribution of
cytoplasmic incompatibility in Drosophila simulans. Genetics, 126(4), 933–48.
Hughes, G. L., Ren, X., Ramirez, J. L., Sakamoto, J. M., Bailey, J. A., Jedlicka, A. E., et al. (2011).
Wolbachia infections in Anopheles gambiae cells: transcriptomic characterization of a
novel host-symbiont interaction. PLoS Pathogens, 7(2), e1001296.
doi:10.1371/journal.ppat.1001296.
Hussain, M., Frentiu, F. D., Moreira, L. a, O’Neill, S. L., and Asgari, S. (2011). Wolbachia uses host
microRNAs to manipulate host gene expression and facilitate colonization of the dengue
vector Aedes aegypti. Proceedings of the National Academy of Sciences of the United
States of America, 108(22), 9250–5. doi:10.1073/pnas.1105469108.
Iturbe-Ormaetxe, I., Burke, G. R., Riegler, M., and Neill, S. L. O. (2005). Distribution , Expression ,
and Motif Variability of Ankyrin Domain Genes in Wolbachia pipientis. Journal of
Bacteriology, 187(15), 5136–5145. doi:10.1128/JB.187.15.5136.
32
Jaenike, J. (2009). Coupled population dynamics of endosymbionts within and between hosts.
Oikos, 118(3), 353–362. doi:10.1111/j.1600-0706.2008.17110.x.
Klasson, L., Walker, T., Sebaihia, M., Sanders, M. J., Quail, M. A., Lord, A., et al. (2008). Genome
evolution of Wolbachia strain wPip from the Culex pipiens group. Molecular Biology and
Evolution, 25(9), 1877–87. doi:10.1093/molbev/msn133.
Kremer, N., Charif, D., Henri, H., Gavory, F., Wincker, P., Mavingui, P., et al. (2012). Influence of
Wolbachia on host gene expression in an obligatory symbiosis. BMC Microbiology,
12(Suppl 1), S7. doi:10.1186/1471-2180-12-S1-S7.
Kremer, N., Voronin, D., Charif, D., Mavingui, P., Mollereau, B., and Vavre, F. (2009). Wolbachia
interferes with ferritin expression and iron metabolism in insects. PLoS Pathogens, 5(10),
e1000630. doi:10.1371/journal.ppat.1000630.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics, 25(14), 1754–60. doi:10.1093/bioinformatics/btp324.
Liu, C., Wang, J. L., Zheng, Y., Xiong, E. J., Li, J. J., Yuan, L. L., et al. (2014). Wolbachia-Induced
Paternal Defect in Drosophila is likely by Interaction with the Juvenile Hormone Pathway.
Insect Biochemistry and Molecular Biology, 49, 49–58. doi:10.1016/j.ibmb.2014.03.014.
Ma, W.J., Vavre, F., and Beukeboom, L. W. (2014). Manipulation of arthropod sex
determination by endosymbionts: diversity and molecular mechanisms. Sexual
Development, 8(1-3), 59–73. doi:10.1159/000357024.
Milani, L., Ghiselli, F., Maurizii, M. G., and Passamonti, M. (2011). Doubly Uniparental
Inheritance of Mitochondria As a Model System for Studying Germ Line Formation. PloS
One, 6(11), e28194. doi:10.1371/journal.pone.0028194.
Minelli, A., Boxshall, G., and Fusco, G. (Eds.). (2013). Arthropod Biology and Evolution. Berlin,
Heidelberg: Springer-Verlag.
Moreau, J., Bertin, A., Caubet, Y., and Rigaud, T. (2001). Sexual selection in an isopod with
Wolbachia-induced sex reversal: males prefer real females. Journal of Evolutionary
Biology, 14, 388–394.
Mortazavi, A., Williams, B. A., Mccue, K., Schaeffer, L., and Wold, B. (2008). Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7), 621–628.
doi:10.1038/NMETH.1226.
Negri, I. (2012). Wolbachia as an “infectious” extrinsic factor manipulating host signaling
pathways. Frontiers in Endocrinology, 2, 115. doi:10.3389/fendo.2011.00115.
33
Negri, I., Franchini, A., Gonella, E., Daffonchio, D., Mazzoglio, P. J., Mandrioli, M., et al. (2009).
Unravelling the Wolbachia evolutionary role: the reprogramming of the host genomic
imprinting. Proceedings of the Royal Society B: Biological Sciences, 276(1666), 2485–2491.
doi:10.1098/rspb.2009.0324.
Negri, I., Franchini, A., Mandrioli, M., Mazzoglio, P. J., and Alma, A. (2008). The gonads of
Zyginidia pullula males feminized by Wolbachia pipientis. Bulletin of Insectology, 61(1),
213–214.
Negri, I., Mazzoglio, P. J., Franchini, A., Mandrioli, M., and Alma, A. (2009). Male or female? The
epigenetic conflict between a feminizing bacterium and its insect host. Communicative and
Integrative Biology, 2(6), 1–2.
Negri, I., Pellecchia, M., Grève, P., Daffonchio, D., Bandi, C., and Alma, A. (2010). Sex and
Stripping: The key to the intimate relationship between Wolbachia and host?
Communicative and Integrative Biology, 3(2), 1–6.
Negri, I., Pellecchia, M., Mazzoglio, P. J., Patetta, A., and Alma, A. (2006). Feminizing Wolbachia
in Zyginidia pullula (Insecta, Hemiptera), a leafhopper with an XX/X0 sex-determination
system. Proceedin of the Royal Society B: Biological Sciences, 273(1599), 2409–16.
doi:10.1098/rspb.2006.3592.
Noda, H., Miyoshi, T., and Koizumi, Y. (2002). In vitro cultivation of Wolbachia in insect and
mammalian cell lines. In Vitro Cellular and Devlopmental Biology-Animal, 38(7), 423–427.
Pannebakker, B. A., Pijnacker, L. P., Zwaan, B. J., and Beukeboom, L. W. (2004). Cytology of
Wolbachia-induced parthenogenesis in Leptopilina clavipes (Hymenoptera: Figitidae).
Genome, 47(2), 299–303. doi:10.1139/g03-137.
Rasgon, J. L., Gamston, C. E., and Ren, X. (2006). Survival of Wolbachia pipientis in cell-free
medium. Applied and Environmental Microbiology, 72(11), 6934–7.
doi:10.1128/AEM.01673-06.
Rigaud, T., and Moreau, J. (2004). A cost of Wolbachia-induced sex reversal and female-biased
sex ratios: decrease in female fertility after sperm depletion in a terrestrial isopod.
Proceedings of the Royal Society B: Biological Sciences, 271(1551), 1941–6.
doi:10.1098/rspb.2004.2804.
Riparbelli, M. G., Giordano, R., Ueyama, M., and Callaini, G. (2012). Wolbachia-mediated male
killing is associated with defective chromatin remodeling. PloS One, 7(1), e30045.
doi:10.1371/journal.pone.0030045.
34
Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–
40. doi:10.1093/bioinformatics/btp616.
Robinson, M. D., and Oshlack, A. (2010). A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biology, 11(3), R25. doi:10.1186/gb-2010-11-
3-r25.
Sánchez, L. (2008). Sex-determining mechanisms in insects. The International Journal of
Developmental Biology, 52(7), 837–56. doi:10.1387/ijdb.072396ls.
Schulenburg, J. H., Hurst, G. D., Huigens, T. M., van Meer, M. M., Jiggins, F. M., and Majerus, M.
E. (2000). Molecular evolution and phylogenetic utility of Wolbachia ftsZ and wsp gene
sequences with special reference to the origin of male-killing. Molecular Biology and
Evolution, 17(4), 584–600.
Serbus, L. R., Casper-Lindley, C., Landmann, F., and Sullivan, W. (2008). The genetics and cell
biology of Wolbachia-host interactions. Annual Review of Genetics, 42, 683–707.
doi:10.1146/annurev.genet.41.110306.130354.
Sugimoto, T. N., and Ishikawa, Y. (2012). A male-killing Wolbachia carries a feminizing factor
and is associated with degradation of the sex-determining system of its host. Biology
Letters, 8(3), 412–5. doi:10.1098/rsbl.2011.1114.
Sze, S-H., Dunham, J. P., Carey, B., Chang, P. L., Li, F., Edman, R. M., et al. (2012). A de novo
transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted
alternative splices, single nucleotide polymorphisms and transcript expression estimates.
Insect Molecular Biology, 21(2), 205–21. doi:10.1111/j.1365-2583.2011.01127.x.
Taylor, D. R. (1990). Evolutionary consequences of cytoplasmic sex ratio distorters. Evolutionary
Ecology, 4(3), 235–248.
The International Aphid Genomics Consortium (2010). Genome sequence of the pea aphid
Acyrthosiphon pisum. PLoS Biology, 8(2), e1000313. doi:10.1371/journal.pbio.1000313.
Turelli, M. (1994). Evolution of Incompatibility-Inducing Microbes and Their Hosts. Evolution,
48(5), 1500–1513.
Verhulst, E. C., van de Zande, L., and Beukeboom, L. W. (2010). Insect sex determination: it all
evolves around transformer. Current Opinion in Genetics and Development, 20(4), 376–83.
doi:10.1016/j.gde.2010.05.001.
35
Weeks, A. R., and Breeuwer, J. A. J. (2001). Wolbachia-induced parthenogenesis in a genus of
phytophagous mites. Proceedings of the Royal Society B: Biological Sciences, 268(1482),
2245–51. doi:10.1098/rspb.2001.1797.
Werren, J. H., Baldo, L., and Clark, M. E. (2008). Wolbachia: master manipulators of
invertebrate biology. Nature Reviews Microbiology, 6(10), 741–51.
doi:10.1038/nrmicro1969.
Werren, J. H., Zhang, W., and Guo, L. R. (1995). Evolution and phylogeny of Wolbachia:
reproductive parasites of arthropods. Proceedings of the Royal Society B: Biological
Sciences, 261(1360), 55–63. doi:10.1098/rspb.1995.0117.
Woolfit, M., Iturbe-Ormaetxe, I., McGraw, E. a, and O’Neill, S. L. (2009). An ancient horizontal
gene transfer between mosquito and the endosymbiotic bacterium Wolbachia pipientis.
Molecular Biology and Evolution, 26(2), 367–74. doi:10.1093/molbev/msn253.
Wu, M., Sun, L. V, Vamathevan, J., Riegler, M., Deboy, R., Brownlie, J. C., et al. (2004).
Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined
genome overrun by mobile genetic elements. PLoS Biology, 2(3), E69.
doi:10.1371/journal.pbio.0020069.
Xi, Z., Gavotte, L., Xie, Y., and Dobson, S. L. (2008). Genome-wide analysis of the interaction
between the endosymbiotic bacterium Wolbachia and its Drosophila host. BMC Genomics,
9, 1. doi:10.1186/1471-2164-9-1.
Ye, J., Fang, L., Zheng, H., Zhang, Y., Chen, J., Zhang, Z., et al. (2006). WEGO: a web tool for
plotting GO annotations. Nucleic Acids Research, 34(Web Server issue), W293–7.
doi:10.1093/nar/gkl031.
Zeh, J. A., Bonilla, M. M., Adrian, A. J., Mesfin, S., and Zeh, D. W. (2012). From father to son:
transgenerational effect of tetracycline on sperm viability. Scientific Reports, 2, 375.
doi:10.1038/srep00375.
Zerbino, D. R., and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Genome Research, 18(5), 821–9. doi:10.1101/gr.074492.107.
36
Chapter 2. Evolutionary genomics of the
house mosquito Culex pipiens: Population
structure and recent adaptive evolution
Summary
We present the first genome-wide study of recent evolution in Culex pipiens species complex
focusing on the genomic extent, functional targets and likely causes of global and local
adaptations. We resequenced pooled samples of six populations of C. pipiens and two
populations of the outgroup C. torrentium. We used Principal Component Analysis to
systematically study differential natural selection across populations and developed a
phylogenetic scanning method to analyze admixture without haplotype data. We found
evidence for the prominent role of geographical distribution in shaping population structure
and specifying patterns of genomic selection. Multiple adaptive events, involving genes
implicated with autogeny, diapause and insecticide resistance were limited to specific
populations. We estimate that about 5-20% of the genes (including several histone genes) and
almost half of the annotated pathways were undergoing selective sweeps in each population.
The high occurrence of sweeps in nongenic regions and in chromatin-remodeling genes
indicated the adaptive importance of gene expression changes. We hypothesize that global
adaptive processes in the Culex pipiens complex are potentially associated with South to North
range expansion, requiring adjustments in chromatin conformation. Strong local signature of
adaptation and emergence of hybrid bridge vectors necessitate genomic assessment of
populations before specifying control agents.
37
Key words: Culex pipiens, selective sweeps, histones, principal component analysis, population
structure, differential selection
38
2.1. Introduction
Arms races between antagonistic species have been of longstanding interest to evolutionary
biologists [1,2]. Humans have developed unique ways of fighting against competitors, parasites
and their vectors through the use of scientific and technological innovations such as application
of antibiotics and pesticides, living in artificially-designed cities with high hygienic standards and
avoidance of high disease-transmission-risk behaviors. The race is, however, far from over, as
emphasized by the emergence of new virulent pathogenic strains, evolution of insecticide-
resistant vectors and the adaptations of several pathogenic or competitor species for living in
cities. Notably, any such adaptation takes place locally at first, even if it does spread to gain
global significance eventually.
Here, we present the results of the first genome-wide analysis of population structure and
recent natural selection in the Culex pipiens complex, members of which are notorious vectors
of West Nile Virus, St. Louise Encephalitis Virus and filariasis worms in the US and world-wide
[3–5]. This complex consists of Culex quinquefasciatus (the southern house mosquito) and C.
pipiens (the northern house mosquito). Two biological forms (biotypes) C. pipiens f. pipiens and
C. pipiens f. molestus have been described within the C. pipiens species based on physiological
and ecological differences including the choice of host species, seasonal activity, mating
behavior and preferred habitat [3,6,7]. Hybridization between these two forms and also with C.
quinquefasciatus has been reported in certain areas leading to the rise of bridge vectors
transmitting pathogens between birds and humans [3,6,8–10].
We studied six C. pipiens populations and two populations of the closely related species C.
torrentium as an outgroup living within or close to human-inhabited areas in Europe and North
America (Moscow and Aleksin, Russia and Sacramento, California, US), aiming to investigate
two fundamental population genetic aspects of C. pipiens: population structure and natural
selection. On population structure, we asked: (a) Does geography, habitat type or biological
39
form mainly determine the organization of genetic variation in Culex? (b) Does genomic data
support genetic isolation and imminent speciation of pipiens and molestus forms, or on the
contrary, we detect considerable admixture between them? On the matter of natural selection
we asked: (a) Which genes and biological functions are the targets of recent selective sweeps?
(b) To what degree do protein sequence alterations and gene expression changes contribute to
adaptation? (c) What factors are the likely causes of recent sweeps? (d) Are recent adaptations
happening congruently or otherwise in different populations?
40
2.2. Methods
2.2.1. Mosquito samples
Mosquito samples were taken from urban and suburban areas in Sacramento (California, US),
Moscow and Aleksin (Central Russia) (Table 2.1).
Table 2.1. Samples used in this study and their average genome-wide variability statistics.
Sample Location Habitat Taxonomical
Identification
# Pooled
individuals
Average
Average
A1 Aleksin Urban C. pipiens f. molestus 224 0.01937 0.02030
A4 Aleksin Suburban C. pipiens f. pipiens 132 0.02403 0.02531
M1 Moscow Urban C. pipiens f. molestus 26 0.01821 0.01905
M2
**
Moscow Suburban C. torrentium 28 0.01933 0.02070
M4
**
Moscow Suburban C. torrentium 195 0.01740 0.01820
S1 Sacramento Urban (males) C. pipiens f. molestus 15 0.02291 0.02354
S2
**
Sacramento Suburban
(males)
C. pipiens, mixed molestus
and pipiens forms
13 0.02347 0.02438
S3
**
Sacramento Suburban
(females)
C. pipiens, mixed molestus
and pipiens forms
64 0.02276 0.02365
* Average of 10kb sliding windows. Only positions covered 4-40X were included.
** The two Moscow suburban samples and the two Sacramento suburban samples were each
caught independently at different sites, and represent different populations.
2.2.2. Sequencing and mapping to the reference
Genomic DNA was extracted from the pool of mosquitoes collected from each of the eight
populations, prepared into separate libraries, and sequenced as paired-end 101bp reads on an
Illumina HiSeq machine. Sequenced reads were aligned as pairs using BWA 0.5.7 [11] to the
complete Culex quinquefasciatus draft genome downloaded from the Broad Institute. Reads
were allowed up to 12 mismatches throughout the 101bp per end; they were mapped to the
genome and those that did not map uniquely were filtered out. All other BWA alignment
parameters were set to default values.
41
2.2.3. Population genetic analyses
The reads mapping to the mitochondrial cytochrome oxidase subunit I gene (COI) and to the
CQ11 microsatellite locus were used to ascertain species and biotype identities of the
populations, respectively [12,13].
F st was calculated for 10kb sliding windows between each pair of populations according to the
methods in [14,15] and averaged across the genome. Maximum likelihood phylogenetic trees
were constructed from sliding windows of non-overlapping 10kb using RAxML [16]. A custom
Python code was used to calculate the percent of time each two populations were nearest
neighbors on the tree. Principal component analysis (PCA) of allele frequencies was done on
biallelic positions with coverage 4-40X.
The software package Popoolation v1.2.2 was used to estimate measures of variation ( π and θ)
and Tajima’s D from the pooled sequence data [17]. Only positions with coverage 4-40X were
used and the minimal legitimate count for the minor allele was set to 2. Synonymous (syn) and
nonsynonymous (nsyn) polymorphisms were assigned using the same software and the .gff file
downloaded from the Broad Institute website.
To detect selective sweeps, we first obtained the allele frequency spectrum (AFS) from the
whole genome as the neutral background, and then tried to identify a certain form of skewness
in AFS of linked sites, typically associated with selective sweeps [18,19]. The approach in [18,19]
has been modified to apply to pooled sequence data [20] and incorporated into the software
package Pool-hmm [21]. We ran Pool-hmm in two steps. First, AFS was built based on the
whole genome for each sample with coverage 4-40X, theta=0.02 (based on the Popoolation
output, see results) and sampling ratio of 20 (5% of positions were used for estimation of AFS).
42
Second, sweep regions were detected separately for each supercontig with the same coverage
range as above and transition probability of k=1e-6 based on the AFS created in the previous
step. PCA of Pool-hmm sweep scores was done to compare the broad patterns of genomic
selection among the studied populations. Each gene was treated as an observation point and
each sample label as an initial variable. We used linear regression to check for potential biases
introduced by sequencing coverage variation in calculation of Tajima’s D and Pool-hmm scores.
To understand the nature of mutations associated with the sweep events, we did a case study
focusing on a 137kb block (C. quinquefasciatus genome supercontig 3.392: 626-137726)
consisting exclusively of 80 histone genes including multiple paralogs of each histone type (H1,
H2A, H2B, H3 and H4). The large number of polymorphic sites in H1 genes allowed statistical
analysis of association of different types of aminoacid changes with certain structural features
of the protein. For the polymorphic positions, we cross-examined three structural attributes
(domain, secondary structure and solvent accessibility) with three biochemical aspects
(addition or removal of proline, the charge difference between the two aminoacids, and
addition or removal of serine or threonine) by independence tests ( χ
2
and Fisher’s exact).
Electrostatic interactions are important in forming the 3D structure of the protein, as well as its
DNA-binding function. The reason to include the Ser/Thr category was that they are targets of
phosphorylation which is the best studied epigenetic modification of H1 [22] although several
other types of modification are also reported [23]. We included Pro because, apart from its
structural peculiarities, we found tremendous excess conversion from other aminoacids in the
reference genome to Pro at polymorphic positions (Suppl. 2.1).
We examined the functional significance of the sweep genes using three different approaches:
Gene Ontology (GO) enrichment analysis, KEGG pathway annotations, and checking the sweep
status of genes with experimentally verified functions reported in the literature. Gene Ontology
(GO) enrichment analysis on targets of selection was performed using the online software
GOEAST [24]. GO annotations for Culex genes were downloaded from vectorbase.org/biomart.
43
The complete annotation file was used as the default background set. For each population two
enrichment tests were run with different selected gene sets: 1) 200 genes (~1% of the total
number of genes in the genome) with highest Pool-hmm scores, and 2) 200 genes with lowest
Tajima’s D values. Enrichments with FDR<0.1 were considered significant. An ANOVA test was
done to make sure that gene length did not bias GO enrichment results (Suppl. 2.1). The
annotated pathways for C. quinquefasciatus were downloaded from the KEGG Pathway
database [25]. For each pathway, maximum and average sweep scores were determined and
number of genes with sweep score>4 was counted. For each gene, the number of pathways it
functioned in was counted as a proxy for multifunctionality (related to the concept of
pleiotropy).
2.2.4. Statistical procedures
All statistical analyses including calculation of descriptive statistics, correlations, independence
tests, regressions, ANOVA and principal component analysis (PCA) were done using SAS v9.3
and SAS JMP Pro 10.0.0.
More details on the methods can be found in the Suppl. 2.1.
44
2.3. Results
2.3.1. Diversity depends on biotype but population structure is shaped by geography
Whole-genome Illumina resequencing of the eight individually pooled populations resulted in
42X total coverage of the C. quinquefasciatus draft genome, identifying 6.7M segregating sites
among these populations. With more than 461MB covered by Illumina sequences, this equated
to roughly one segregating allele every 69 nucleotides. Table 2.1 shows average π and θ for
10kb sliding windows across the genomes of the eight samples. The divergence of our C. pipiens
populations from the C. quinquefasciatus reference genome ranged 0.6-1.8 %. Expectedly, C.
torrentium populations were more divergent (2.8-3.3%). In accordance with this, about 60-75%
and 41-46% of total Illumina reads from C. pipiens and C. torrentium samples mapped onto the
reference genome, respectively. The average sequencing depth across the whole covered
segments of the genomes was 2-8X for our samples (the lowest ones belonging to C.
torrentium). However, we only included the positions with coverages 4-40X in estimation of
diversity and selection metrics. The average coverage across those positions in genic regions
(whose output was fed into the PCA) ranged 8-19X. Only 7.3% and 5.4% of variance in the
observed values of Tajima’s D and Pool-hmm score for genes could be explained by coverage
differences, respectively (linear regression R
2
, p<0.0001). PCA on gene coverages did not
produce any patterns similar to the geography-dependent clustering observed for the selection
metrics. Thus, coverage did not seem to bias the calculation of population parameters
significantly.
Differentiation among populations depended on geographical distance demonstrated by 10kb
sliding window genomic scans of F st and phylogenetic tree structure (Table 2.2). Expectedly, the
largest distances belonged to C. torrentium vs. C. pipiens comparisons. Within the C. pipiens
complex, population structure corresponded to geographical proximity for both Russian and
American samples. Neither F st nor phylogenies suggested clustering based on habitat type
45
(urban vs. suburban) or biological form (molestus vs. pipiens). PCA of allele frequencies
mirrored this image (Suppl. 2.2). The reference sequence (C. quinquefasciatus) clustered closely
with two C. pipiens samples only: A1 and S1 (Table 2.2). The shared genomic regions
contributing to this closeness, therefore, seem likely to have originated from recent local
admixture rather than ancestral shared polymorphisms between C. pipiens and C.
quinquefasciatus.
Table 2.2. Population structure in the eight samples demonstrated through pairwise Fst values
and phylogenetic frequency of neighborhood. The lower half of the table reports the average
Fst of 10kb sliding windows in pairwise comparisons. The upper half, shows in what percent of
the phylogenetic trees based on 10kb windows, each two populations are nearest neighbors.
A1 A4 M1 M2 M4 S1 S2 S3 Ref
A1 - 15.60 12.16 0.64 0.80 7.68 7.83 8.54 8.07
A4 0.166 - 12.90 2.24 1.50 5.19 6.05 6.66 3.90
M1 0.211 0.211 - 1.32 1.10 8.66 10.59 13.05 3.41
M2 0.487 0.457 0.497 - 78.23 0.58 0.77 0.58 2.55
M4 0.502 0.461 0.499 0.144 - 1.14 0.71 0.58 4.97
S1 0.193 0.219 0.241 0.451 0.460 - 15.29 15.72 9.18
S2 0.176 0.201 0.228 0.430 0.427 0.143 - 18.61 5.86
S3 0.160 0.187 0.217 0.414 0.402 0.143 0.138 - 3.56
2.3.2. Positive selection acts on noncoding and coding regions with nonsynonymous
mutations playing an important role
About 50-65% of the regions targeted by Pool-hmm with high confidence scores in C. pipiens
populations coincided with coding sequences of annotated genes. Scanning the genome in 10kb
sliding windows, we found that the windows containing genic sequences were generally more
likely to overlap with a sweep region (Odds Ratio=1.19, independence χ
2
test, p<0.0001). In all
of the C. pipiens samples, the total number of coding sequence polymorphisms per gene, as
well as the number of either syn or nsyn polymorphisms was smaller in sweep regions
compared to the rest of the genome, compatible with the purported reduced variation around
46
the sweep site (Table 2.3). On the other hand, the ratio of nsyn/syn sites was always higher in
the sweep regions. In C. torrentium samples, the trends were not exactly similar. The higher
nsyn/syn ratio was still true; however, the correlation of each type of polymorphism with
sweep status was either nonexistent or slightly positive.
Table 2.3. Coding sequence polymorphisms within and outside sweep regions. Sweep status 0: gene
resides in a region not detected by Pool-hmm, or detected with a score <4; sweep status 1: gene
resides in a region detected by Pool-hmm with a score ≥4; N: number of genes; Total: genewise
average of all polymorphisms in the cds; Syn: genewise average of synonymous polymorphisms;
Nsyn: genewise average of nonsynonymous polymorphisms; values in the correlations columns
represent Spearman partial correlation coefficients controlled for gene length; N.S.: correlation not
significant at p=0.05; figures in parentheses: p-value of the correlation (p-value<0.0001 where not
stated). Repeating the analysis with sweep scores >8 and >2 as the cutoff yielded very similar results
(not shown). In calculation of the ratio, 0.5 was added to both Syn and Nsyn counts to avoid division
by zero.
Genes with sweep status 0 Genes with sweep status 1 Correlation with sweep status 1
Sample N Total Syn Nsyn N Total Syn Nsyn Total Syn Nsyn
Nsyn/Syn
ratio
A1 19120 43.52 30.00 13.52 1186 17.50 7.94 9.56
-
0.1574 -0.1985 -0.0743 0.2147
A4 19129 40.91 29.27 11.64 1177 22.51 11.81 10.70
-
0.1210 -0.1634 -0.0263 0.1894
M1 16114 44.54 27.42 17.12 4192 29.90 14.93 14.97
-
0.1896 -0.2407 -0.0990 0.2441
M2 18376 26.48 15.03 11.44 1930 34.75 18.72 16.03 0.0330
0.0157
(0.0252) 0.0563 0.0622
M4 19260 9.70 5.77 3.93 1046 10.23 5.81 4.42 N.S. N.S.
0.0256
(0.0003)
0.0202
(0.0040)
S1 17713 39.48 28.62 10.87 2593 19.95 11.06 8.89
-
0.2221 -0.2748 -0.0934 0.2655
S2 18055 62.80 42.74 20.06 2251 30.11 16.29 13.81
-
0.2390 -0.2828 -0.1336 0.2647
S3 17510 61.82 40.12 21.70 2796 30.39 16.45 13.94
-
0.2766 -0.3107 -0.1852 0.2478
A summary of Tajima’s D values and Pool-hmm scores can be found in Suppl. 2.3.
47
2.3.3. Positive selection acts on a wide variety of biological functions in the Culex genome
from chromatin organization to insecticide resistance
Table 2.3 shows the number of genes detected as sweep targets in each population. Based on
both Pool-hmm and Tajima’s D data, the most commonly enriched GO terms were related to
chromatin and nucleosome structure and modification (Suppl. 2.4). The list of genes
contributing to these terms included histones and chromatin remodeling factors (not shown).
This was surprising given the well-known conservation of histone sequences, and motivated us
to investigate the likely causes of selective sweeps in histones from a structure-function
perspective in more details (sections 3.5 and 4.5). Gene length proved not to be a significant
confounder in the GO enrichment analysis (Suppl. 2.4).
Examination of the Pool-hmm results in the context of the functional pathways annotated for C.
quinquefasciatus in the KEGG database demonstrated two important points: First, in every one
of the eight populations more than half of the 129 pathways were affected by positive selection
as they contained at least one gene with a sweep score>4. These pathways encompassed a
large variety of functions including but not limited to aminoacid biosynthesis, Glycosphingolipid
biosynthesis, signaling pathways (Notch, Jak-STAT and MAPK) and dorso-ventral axis formation.
Second, the number of pathways a gene functioned in was not correlated with the number of
populations it was selected in or the strength of selection when it happened.
Finally, we compiled a list of genes that have been shown by gene expression or mutant
phenotyping studies to affect specific life history traits of Culex (such as diapause, autogeny and
mating behavior), confer insecticide resistance or facilitate adaptation to temperature
fluctuations. Histones and chromatin remodeling factors, ribosomal proteins, members of the
P450 family, chaperonins and heat-shock proteins, vitellogenins and vitellogenin convertase,
cadherins, superoxide dismutases and salivary proteins were noticeable genes with
experimentally verified functional roles that were undergoing sweeps in multiple populations
(Suppl. 2.4, p. 10).
48
2.3.4. Many specific adaptations in the C. pipiens genome happen locally
For both Pool-hmm and Tajima’s D, we performed PCA once with the 6 C. pipiens samples only,
and a second time with all of 8 Culex samples included. The first Principal Component (PC) was
always highly correlated with all of the sample labels and did not separate the samples from
each other decisively (Fig. 2.1a,c). In the case of the 6 C. pipiens samples, the second and third
PCs demonstrated the local nature of adaptation in the most conspicuous way, producing three
distinct clusters containing the samples from Moscow, Aleksin and Sacramento (Fig. 2.1b).
Including all 8 samples, the combination of the second and the third components produced four
clusters: one for the C. torrentium samples and three for each of the C. pipiens locations (Fig.
2.1d). The results of PCA on Tajima’s D values were very similar (not shown).
2.3.5. Evolution of the conserved: parallel adaptation of histones in C. pipiens and C.
torrentium
The 137kb histone block we examined closely showed reasonably high sweep scores in all of
our eight populations. Compared with the non-sweep portions of the genome, this block had
lower polymorphism but an increased nsyn/syn ratio (Suppl. 2.5), a trend similar to the one
demonstrated in Table 2.3 for sweep genes in general. Most of the nsyn polymorphisms
occurred at positions where there was no variation among paralogs in the reference, indicating
they were bona fide polymorphic sites as opposed to artifacts of mis-mapping onto paralogs
(data not shown).
49
a
b
c
d
Figure 2.1. The first three principal components of Pool-hmm scores of genes from C. pipiens samples (a, b) or C.
pipiens and C. torrentium together (c, d).
We found the linker histone (H1) to be the most polymorphic among all histone genes
consistent with the fact that it is the least conserved among histones (Suppl. 2.5). The basic
structure of histone H1 consists of three main domains: A lysine-rich C-terminal domain which
binds to linker DNA (the C domain), a central globular domain with a winged helix motif which
binds to the nucleosomal DNA (the G domain), and an N-terminal domain whose function is not
very well understood (the N domain) [26]. Generally, the globular domain is the most
evolutionarily conserved (across taxa and among paralogs) and the N-terminal domain is the
most variable [27].
50
Examination of polymorphic sites in H1 genes confirmed our expectations based on the known
evolutionary patterns. The G and N domains showed the lowest and highest propensity for nsyn
mutations, respectively (Suppl. 2.6, p. 2). Nsyn mutations were quite uncommon in regular
secondary structures and in buried residues (Suppl. 2.6, pp. 3-4), although these two states
tended to coincide with the globular domain, confounding the analysis. Charge-altering
mutations were also exceedingly rare in buried residues (Suppl. 2.6, p. 7). About 90% of the
changes adding or removing Ser/Thr occurred in exposed residues (Suppl. 2.6, p. 8) making
them potential targets for epigenetic modification.
Polymorphisms converting other aminoacids to Pro were vastly overrepresented in the histone
block in all of our 8 samples (p<0.0001). We did not test for other aminoacid conversions, so
there may well be other cases of overrepresentation or underrepresentation in the data that
we did not capture. Remarkably, the positions of Pro-permissive mutations in the conserved G
domain were much more consistent across populations compared with the N domain. Among
the Pro-permissive residues in the G domain, 78.6% showed Pro mutations in multiple
populations. In contrast, only 17.5% of Pro-permissive residues in the N domain showed Pro
mutations in more than one population (Suppl. 2.6, p10). This suggested that Pro mutations in
the N domain were mostly neutral or semi-neutral segregating variants in random positions
whereas at least some of the Pro mutations in the G domain happened at specific positions and
were probably favored by selection. Visual inspection of the Pro polymorphisms in G and C
domains suggested that almost all of them occurred in irregular parts that connected the
regular secondary structures or were located on the domain boundaries.
Generally, nsyn polymorphisms and specifically those involving Pro occurred at similar residues
in same-species populations, but were independently positioned when two populations of
different species were compared (Suppl. 2.7). This verified the expectation of efficient isolation
of the C. pipiens and C. torrentium gene pools. Therefore, whatever the evolutionary force
51
behind the overabundance of conversions to Pro might be, it seems to be happening
independently and in parallel in C. pipiens and C. torrentium.
2.4. Discussion
2.4.1. Diversity level and population structure
The three pure biotype molestus populations (A1, M1 and S1) showed reduced variation
compared to the one pure biotype pipiens population (A4) (Table 2.1) consistent with previous
findings suggesting founder effects during the establishment of molestus populations [28–30].
The dependence of population structure on localities (Table 2.2) agrees with previous reports
on US populations [8,30,31] (but also see [32]) but contrasts with the distinct f. molestus vs. f.
pipiens dichotomy in northern and central Europe [8,29,33].
2.4.2. Mapping efficiency and coverage effects
Theoretical and computational tools are still being developed for pool-seq data analysis [34–
36]. It has been proposed that estimates of allele frequencies and population genetic
parameters can be improved with increased sequencing depth (up to 20-30X, but not above
that [34,36]) and pooling large enough numbers of individuals (about 25 diploid individuals or
more [36]). The recommendation of coverage threshold of 20-30X in [34,36] was made to
ensure faithful estimates of “single-site” allele frequencies. When diversity or selection
parameters are calculated by averaging over 10kb windows or the length of genes –which are
typically several hundred bases long- even lower coverages ought to yield satisfactory results.
Accordingly, in a study using simulated data of up to 2% divergence from reference sequence, it
has been suggested that direct estimation of population genetic parameters without SNP and
genotype calling yields reasonably good results even at low coverages (2-4X) [37]. The methods
we used for estimation of diversity and selection strength worked directly on base counts from
sequencing reads without any intermediate SNP or genotype calling steps. The coverage in our
52
included genic positions ranged 8-19X which lay between the two above recommendations.
Nevertheless, the relatively low sequencing depth in some regions could have resulted in false
negatives in the detection of selection targets, because they were disregarded by our 4-40X
filter.
2.4.3. Signature of directional selection in coding sequences
It is well known that some AFS-based metrics of positive selection are sensitive to demography
or behave similarly under purifying selection and positive selection (e.g. Tajima’s D) [38]. Some
false positives might exist among the Pool-hmm hits, too; but, we are optimistic that most of
the detected sweeps are likely to be true positives. Pool-hmm identifies sweep regions based
on changes in the AFS regardless of the annotatory features of the alleles; thus, the
combination of lower levels of variation and higher ratios of nsyn/syn (Table 2.3) provides
independent support for the action of positive selection. Purifying selection would reduce total
variation but would also decrease the nsyn/syn ratio. Relaxation of selection would increase
nsyn/syn ratio but would also elevate total variation. Still, alternative scenarios can be
envisaged that would produce the kind of pattern we see in our data; for example, a severe
decline in population size (a bottleneck) could result in reduced total variation and selection
relaxation at the same time. The mathematical models based on which the Pool-hmm method
was devised, were shown through simulations and tests on real data to be relatively robust to
several types of demographic changes [18,19]. But unfortunately in the absence of ecological
data on our Culex populations, the results of such simulations cannot be confidently extended
to them. Further work will be required to disentangle true sweep signals from potential
confounders.
A closer look at Table 2.3 reveals that the higher nsyn/syn ratio in sweep regions resulted
mainly from depletion of syn mutations. The reason is that in general most of the neutrally
segregating variation is syn; so when linked variation is removed from around a sweep site, the
reduction in syn variation will be larger. A possible explanation for more abundant nsyn
53
mutations in sweep genes in M2 and M4 when no or a weaker correlation is observed for syn
mutations is that a larger proportion of sweeps in C. torrentium were fostered by new
mutations (hard sweeps) whereas most of C. pipiens sweeps depended on standing variation
(soft sweeps). This scenario seems particularly likely in the case of the M4 population which has
very low levels of standing variation – reflected by smallest number of polymorphic sites in non-
swept genes (Table 2.3) and smallest π and θ values (Table 2.1) - providing little raw material for
positive selection. Accordingly, M4 shows the lowest number of detected sweep events (Table
2.3). In Moscow region, populations of C. torrentium have expanded rapidly in the past 10 years
[39] indicating that small genetic diversity within M2 and M4 may be due to founder effect.
An excess of nsyn to syn “fixed” differences (divergence) among multiple taxa is often used as a
basis for inference of recurrent positive selection [40–42]. The sites identified by those tests are
likely to be the direct targets of selection, and emerge due to positive selection on the nsyn
mutations. In contrast, what we present in Table 2.3 is the ratio of nsyn/syn “segregating
polymorphisms” (not fixed differences) averaged over the ~20k genes in individual populations,
not for single nucleotide positions across populations or taxa. What we have shown here is that
an excess of nsyn/syn “segregating polymorphisms” concurs with selective sweeps.
2.4.4. Principal Component Analysis on selection metrics as a method of detecting differential
selection
PCA across samples on Pool-hmm scores compares the strength of recent adaptive evolution on
genes, whereas PCA on Tajima’s D captures the contrast between balancing selection (D>0) and
positive or purifying selection (D<0) against neutral evolution (D=0) [43]. In either case, the
strong correlation of PC 1 with all sample labels meant that most of the genes performed
similar functions and were thus selected congruently across the tested populations. This would
be expected since our populations all belonged to the same species or closely related ones. On
the other hand, small fractions of genes were expected to underlie adaptations unique to each
specific population or groups of populations and were supposed to contribute to creation of
54
second, third, fourth and further principal components. In the PCA on all 8 samples, the second
component separated C. pipiens samples from C. torrentium ones indicating that the
differential selection of genes was driven primarily by species differences (figure 2.1c). The
portion of variance explained by the second component in the 8 sample analysis (17.47%) was
higher than that explained by any other second or third component, implying that interspecific
differences were greater than those caused by geographical isolation of conspecifics.
PC patterns of sweep scores and allele frequencies share a common feature: they are shaped
first by species identity and then by geographical distance. Migration between geographically
close populations may have contributed to the similarity of allele frequencies and
consequently, the detected targets of selection; so do the PCs of Pool-hmm merely reflect
population structure? The answer is interestingly NO. First, we need to point out a key
difference: in contrast with sweep scores, we do not see a PC1 correlating highly with all of the
population labels with allele frequencies. The reason is that we did the PCA only on the
polymorphic positions to save on computation time. The majority of genomic positions were
fixed for the same allele across all populations and were filtered out. So, for allele frequencies
significant differentiation among populations starts with PC1. PC1 and PC2 of allele frequencies
are qualitatively comparable to PC2 and PC3 of sweep scores. Comparing PCA of allele
frequencies and Pool-hmm score shows that the order of clustering among the 6 C. pipiens
populations is different between them. PC1 of allele frequencies separates Moscow and Aleksin
from Sacramento (Suppl. 2.2). Data in Table 2.2 confirm that Moscow and Aleksin populations
have generally more similar allele frequencies than either have with Sacramento populations.
On the contrary, PC1 and particularly PC2 of sweep scores put Aleksin populations closer to
Sacramento than Moscow (Fig. 2.1). This means that within the C. pipiens species, sweep status
does not follow population structure. Besides, F st between collocal populations is just slightly
smaller than F st between different localities (Table 2.2); for example, F st between A1 and A4 is
0.166, while F st between A1 and Sacramento populations is in the range of 0.160-0.193. This
makes it unlikely that gene flow between collocal populations is so strong to create similar AFS
in them and leads to corresponding sweep hits. Finally, it should be noticed that allele
55
frequency PCA plots were created from biallelic positions (a fraction of polymorphic positions)
regardless of gene content; so, presumably most of them came from nongenic parts because
only ~110MB out of the ~579MB of the reference genome is genic sequence (including introns),
not to mention that polymorphism is expected to be lower in genic sequences on average. On
the other hand, Pool-hmm PCA plots used the sweep scores from genic sequences, consisting of
monomorphic as well as polymorphic sites. So the two sets of PCA plots represent two
potentially overlapping but completely different subsets of genomic positions. Without a formal
significance test, it is not possible to statistically disprove or quantitatively evaluate the
proposition that demography affects our ‘detected’ selective status. What we can ascertain
definitely is that Pool-hmm and Tajima’s D do not absolutely follow F st, phylogenetic
neighborhood or allele frequency PCs. In other words, the variation in neither of the former
two can be completely explained by any of the latter three (contribution to the signal of
selection is possible but it is never 100%).
We performed PCA on Pool-hmm scores and Tajima’s D but it can be as effectively applied to
any other selection statistics. There are a great number of indicators of natural selection
(including those based on aminoacid substitutions, length of haplotype homozygosity, etc.),
each optimized to identify certain types of selection and within certain time depths (reviewed,
for example in [44]). PCA will make it possible to use any of them comparatively across
populations or taxa to characterize the patterns of differential selection.
From an epidemiological perspective, the dependence of population structure on geographical
distribution and strong local signature of adaptations suggest that vector control schemes
should be informed by population-specific data rather than presumed global properties of the
Culex species complex. For instance, rapid evolution of many insecticide resistance genes
indicates that the efficacy of insecticides on each Culex population will have to be tested
frequently and on specimens from the same locality with as small gridding as possible.
56
2.4.5. The special case of histones
Histones are known to be among the most evolutionary conserved genes, although they have
been reported to have responded to recent directional selection [45,46]. However, the
selective pressures that drive their evolution are not well understood.
Analysis of the biochemical vs. structural properties of aminoacid residues at polymorphic sites
suggested that Pro mutations in the G and C domains of H1 probably acted to modify the
orientation of regular structures with respect to each other in space or adjust the
rigidity/flexibility of the existing structures without disrupting the basic fold of the protein or its
function.
Because seasonal variation in temperature is more substantial in temperate climates compared
to tropical or subtropical zones, subfunctionalization or neofunctionalization of duplicated
genes [47] to accommodate these new environmental conditions seems like a possible scenario
for Culex histone evolution. This scenario is corroborated by the heterogeneous distribution of
polymorphic sites among paralogs in all of the studied populations (Independence χ
2
p<0001;
caution: test statistic might have been inflated due to small number of expected
polymorphisms at some loci).
2.4.6. The marks of south to north range expansion
C. pipiens originated in North Africa and then spread out to colonize other parts of the world
[48]. We found certain sweep events that might have helped them adapt to the new
environmental conditions (Suppl. 2.4).
Heat shock proteins and chaperonins are crucial for proper protein folding in the cells and also
contribute to adaptation to living at high and low temperatures [49,50]. Interestingly,
57
expression of a specific chaperonin component has been reported to be crucial for cold
resistance in diapausing members of the Onion Maggot Delia antiqua [51,52]. Positive selection
on chaperonins may be a general response to colder climate or bigger seasonal fluctuations in
temperature; however, the significantly higher number of sweep genes in C. pipiens compared
to C. torrentium makes it more likely to be a specialized adaptation to winter diapause in the
colder habitats. Biotype molestus mosquitoes do not undergo diapause during winter, but they
have branched off from the pipiens form very recently, so it should not be surprising that they
still bear the genomic signatures of diapause-related adaptations.
Chromatin-related factors including histones showed signals of strong sweep in all of the tested
populations, and constituted the most significantly enriched GO terms. As major modulators of
gene expression, they are known to contribute substantially to adaptation to new
environmental challenges [53,54]. Positive selection in several chromatin remodeling genes has
been associated with range expansion from tropical to temperate environments in Drosophila
[55,56]; it is then plausible to suggest that these genes may have played an equally important
adaptive role during the spread of Culex from tropical North Africa to temperate and cold
habitats. Interestingly, many regulatory sequences and unannotated sequences have been
reported to be highly differentiated between tropical and temperate Drosophila populations,
presumably contributing to adaptation to new environmental conditions [56]. Sweeps in
histones and chromatin modifiers along with the large proportion of sweeps occurring in
noncoding regions emphasize the significance of gene expression regulation as a mechanism of
adaptive evolution in Culex.
Data accessibility
The Illumina data is available at NCBI under the BioProject PRJNA284197.
58
References
1. Dawkins, R. & Krebs, J. R. 1979 Arms races between and within species. Proc. R. Soc. B
Biol. Sci. 205, 489–511.
2. Gandon, S. & Michalakis, Y. 2002 Local adaptation, evolutionary potential and host-
parasite coevolution: interactions between migration, mutation, population size and
generation time. J. Evol. Biol. 15, 451–462. (doi:10.1046/j.1420-9101.2002.00402.x)
3. Nelms, B. M., Macedo, P. a, Kothera, L., Savage, H. M. & Reisen, W. K. 2013
Overwintering biology of Culex (Diptera: Culicidae) mosquitoes in the Sacramento Valley
of California. J. Med. Entomol. 50, 773–90.
4. Arensburger, P. et al. 2010 Sequencing of Culex quinquefasciatus establishes a platform
for mosquito comparative genomics. Science (80-. ). 330, 86–8.
(doi:10.1126/science.1191864)
5. Farajollahi, A., Fonseca, D. M., Kramer, L. D. & Kilpatrick, A. M. 2011 “Bird biting”
mosquitoes and human disease: a review of the role of Culex pipiens complex
mosquitoes in epidemiology. Infect. Genet. Evol. 11, 1577–1585.
(doi:10.1016/j.meegid.2011.08.013.)
6. Strickman, D. & Fonseca, D. M. 2012 Autogeny in Culex pipiens complex mosquitoes from
the San Francisco Bay Area. Am. J. Trop. Med. Hyg. 87, 719–26.
(doi:10.4269/ajtmh.2012.12-0079)
7. Spielman, A. 2001 Structure and seasonality of nearctic Culex pipiens populations. Ann.
N. Y. Acad. Sci. 951, 220–34.
8. Fonseca, D. M., Keyghobadi, N., Malcolm, C. a, Mehmet, C., Schaffner, F., Mogi, M.,
Fleischer, R. C. & Wilkerson, R. C. 2004 Emerging vectors in the Culex pipiens complex.
Science (80-. ). 303, 1535–8. (doi:10.1126/science.1094247)
9. Gomes, B. et al. 2009 Asymmetric introgression between sympatric molestus and pipiens
forms of Culex pipiens (Diptera: Culicidae) in the Comporta region, Portugal. BMC Evol.
Biol. 9, 262. (doi:10.1186/1471-2148-9-262)
10. Cornel, A., Lee, Y., Fryxell, R. T., Siefert, S., Nieman, C. & Lanzaro, G. 2012 Culex pipiens
Sensu Lato in California : A Complex Within a Complex? J. Am. Mosq. Control Assoc. 28,
113–121.
11. Li, H. & Durbin, R. 2009 Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–60. (doi:10.1093/bioinformatics/btp324)
59
12. Bahnck, C. M. & Fonseca, D. M. 2006 Rapid assay to identify the two genetic forms of
Culex (Culex) pipiens L. (Diptera: Culicidae) and hybrid populations. Am. J. Trop. Med.
Hyg. 75, 251–5.
13. Hesson, J. C., Lundström, J. O., Halvarsson, P., Erixon, P. & Collado, A. 2010 A sensitive
and reliable restriction enzyme assay to distinguish between the mosquitoes Culex
torrentium and Culex pipiens. Med. Vet. Entomol. 24, 142–9. (doi:10.1111/j.1365-
2915.2010.00871.x)
14. Remolina, S. C., Chang, P. L., Leips, J., Nuzhdin, S. V & Hughes, K. A. 2012 GENOMIC BASIS
OF AGING AND LIFE-HISTORY EVOLUTION IN DROSOPHILA MELANOGASTER. Evolution (N.
Y). 66, 3390–3403. (doi:10.5061/dryad.94pv0)
15. Jalvingh, K. M., Chang, P. L., Nuzhdin, S. V & Wertheim, B. 2014 Genomic changes under
rapid evolution: selection for parasitoid resistance. Proc. Biol. Sci. 281, 20132303.
(doi:10.1098/rspb.2013.2303)
16. Stamatakis, A. 2006 RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses
with thousands of taxa and mixed models. Bioinformatics 22, 2688–90.
(doi:10.1093/bioinformatics/btl446)
17. Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., Kosiol,
C. & Schlötterer, C. 2011 PoPoolation: a toolbox for population genetic analysis of next
generation sequencing data from pooled individuals. PLoS One 6, e15925.
(doi:10.1371/journal.pone.0015925)
18. Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. G. & Bustamante, C. 2005
Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–75.
(doi:10.1101/gr.4252305)
19. Boitard, S., Schlötterer, C. & Futschik, A. 2009 Detecting selective sweeps: a new
approach based on hidden markov models. Genetics 181, 1567–78.
(doi:10.1534/genetics.108.100032)
20. Boitard, S., Schlötterer, C., Nolte, V., Pandey, R. V. & Futschik, A. 2012 Detecting selective
sweeps from pooled next-generation sequencing samples. Mol. Biol. Evol. 29, 2177–86.
(doi:10.1093/molbev/mss090)
21. Boitard, S., Kofler, R., Françoise, P., Robelin, D., Schlötterer, C. & Futschik, A. 2013 Pool-
hmm: a Python program for estimating the allele frequency spectrum and detecting
selective sweeps from next generation sequencing of pooled samples. Mol. Ecol. Resour.
13, 337–40. (doi:10.1111/1755-0998.12063)
60
22. Zheng, Y. et al. 2010 Histone H1 phosphorylation is associated with transcription by RNA
polymerases I and II. J. Cell Biol. 189, 407–15. (doi:10.1083/jcb.201001148)
23. Harshman, S. W., Young, N. L., Parthun, M. R. & Freitas, M. A. 2013 H1 histones: current
perspectives and challenges. Nucleic Acids Res. , 1–17. (doi:10.1093/nar/gkt700)
24. Zheng, Q. & Wang, X.-J. 2008 GOEAST: a web-based software toolkit for Gene Ontology
enrichment analysis. Nucleic Acids Res. 36, W358–63. (doi:10.1093/nar/gkn276)
25. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. 2012 KEGG for integration
and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–14.
(doi:10.1093/nar/gkr988)
26. Brennan, R. G. 1993 The W inged-Helix DNA-Binding Motif : Another Helix-Turn-Helix
Takeoff. Cell 74, 773–776.
27. Kasinsky, H. E., Lewis, J. D., Dacks, J. B. & Ausió, J. 2001 Origin of H1 linker histones.
FASEB J. 15, 34–42. (doi:10.1096/fj.00-0237rev)
28. Byrne, K. & Nichols, R. a 1999 Culex pipiens in London Underground tunnels:
differentiation between surface and subterranean populations. Heredity (Edinb). 82 (Part
1, 7–15.
29. Becker, N., Jöst, A. & Weitzel, T. 2012 The Culex pipiens complex in Europe. J. Am. Mosq.
Control Assoc. 28, 53–67.
30. Kothera, L., Godsey, M., Mutebi, J.-P. & Savage, H. M. 2012 A comparison of above-
ground and below-ground populations of Culex pipiens pipiens in Chicago, Illinois, and
New York City, New York, using 2 microsatellite assays. J. Am. Mosq. Control Assoc. 28,
106–12.
31. Kothera, L., Godsey, M., Mutebi, J.-P. & Savage, H. M. 2010 A Comparison of
Aboveground and Belowground Populations of
Culex pipien
s (Diptera: Culicidae)
Mosquitoes in Chicago, Illinois, and New York City, New York, Using Microsatellites. J.
Med. Entomol. 47, 805–813. (doi:10.1603/ME10031)
32. Kent, R. J., Harrington, L. C. & Norris, D. E. 2007 Genetic Differences Between Culex
pipiens f . molestus and Culex pipiens pipiens (Diptera : Culicidae) in New York. J. Med.
Entomol. 44, 50–59.
33. Weitzel, T., Collado, A., Jöst, A., Pietsch, K., Storch, V. & Becker, N. 2009 Genetic
Differentiation of Populations within the Culex pipiens Complex and Phylogeny of Related
Species. J. Am. Mosq. Control Assoc. 25, 6–17.
61
34. Zhu, Y., Bergland, A. O., González, J. & Petrov, D. a 2012 Empirical validation of pooled
whole genome population re-sequencing in Drosophila melanogaster. PLoS One 7,
e41901. (doi:10.1371/journal.pone.0041901)
35. Gautier, M. et al. 2013 Estimation of population allele frequencies from next-generation
sequencing data: pool-versus individual-based genotyping. Mol. Ecol. 22, 3766–79.
(doi:10.1111/mec.12360)
36. Ferretti, L., Ramos-Onsins, S. E. & Pérez-Enciso, M. 2013 Population genomics from pool
sequencing. Mol. Ecol. 22, 5561–76. (doi:10.1111/mec.12522)
37. Nevado, B., Ramos-Onsins, S. E. & Perez-Enciso, M. 2014 Resequencing studies of
nonmodel organisms using closely related reference genomes: optimal experimental
designs and bioinformatics approaches for population genomics. Mol. Ecol. 23, 1764–
1779. (doi:10.1111/mec.12693)
38. Nielsen,Rasmus 2005 Molecular signatures of natural selection. Annu. Rev. Genet. 39,
197–218. (doi:10.1146/annurev.genet.39.073003.112420)
39. Vinogradova, E. B., Shaikevich, E. V & Ivanitsky, A. V 2007 A study of the distribution of
the Culex pipiens complex (Insecta : Diptera : Culicidae) mosquitoes in the European part
of Russia by molecular methods of identification. Comp. Cytogenet. 1, 129–138.
40. Hudson, R. R., Kreitman, M. & Aguade, M. 1987 A test of neutral molecular evolution
based on nucleotide data. Genetics 116, 153–159.
41. Kreitman, M. & Hudson, R. R. 1991 Inferring the evolutionary histories of the Adh and
Adh-dup loci in Drosophila melanogaster from patterns of polymorphism and divergence.
Genetics 127, 565–82.
42. Macpherson, J. M., Sella, G., Davis, J. C. & Petrov, D. a 2007 Genomewide spatial
correspondence between nonsynonymous divergence and neutral polymorphism reveals
extensive adaptation in Drosophila. Genetics 177, 2083–99.
(doi:10.1534/genetics.107.080226)
43. Tajima, F. 1989 Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism. Genetics 123, 585–95.
44. Sabeti, P. C. et al. 2006 Positive natural selection in the human lineage. Science (80-. ).
312, 1614–20. (doi:10.1126/science.1124309)
45. Malik, H. S. & Henikoff, S. 2001 Adaptive evolution of Cid, a centromere-specific histone
in Drosophila. Genetics 157, 1293–8.
62
46. Berdnikov, V. A., Bogdanova, V. S., Rozov, S. M. & Kosterin, E. 1993 Geographic patterns
of histone HI allelic frequencies formed in the course of Pisum sativum L . (pea)
cultivation. Heredity (Edinb). 71, 199–209.
47. Kondrashov, F. a, Rogozin, I. B., Wolf, Y. I. & Koonin, E. V 2002 Selection in the evolution
of gene duplications. Genome Biol. 3, research0008.1–0008.9.
48. Harbach, R. E. 2011 Classification within the cosmopolitan genus Culex (Diptera:
Culicidae): the foundation for molecular systematics and phylogenetic research. Acta
Trop. 120, 1–14. (doi:10.1016/j.actatropica.2011.06.005)
49. Fujiwara, S., Aki, R., Yoshida, M., Higashibata, H., Imanaka, T. & Fukuda, W. 2008
Expression profiles and physiological roles of two types of molecular chaperonins from
the hyperthermophilic archaeon Thermococcus kodakarensis. Appl. Environ. Microbiol.
74, 7306–12. (doi:10.1128/AEM.01245-08)
50. Somer, L., Shmulman, O., Dror, T., Hashmueli, S. & Kashi, Y. 2002 The eukaryote
chaperonin CCT is a cold shock protein in Saccharomyces cerevisiae. Cell Stress
Chaperones 7, 47–54.
51. Kayukawa, T., Chen, B., Miyazaki, S., Itoyama, K., Shinoda, T. & Ishikawa, Y. 2005
Expression of mRNA for the t-complex polypeptide-1, a subunit of chaperonin CCT, is
upregulated in association with increased cold hardiness in Delia antiqua. Cell Stress
Chaperones 10, 204–10.
52. Kayukawa, T. & Ishikawa, Y. 2009 Chaperonin contributes to cold hardiness of the onion
maggot Delia antiqua through repression of depolymerization of actin at low
temperatures. PLoS One 4, e8277. (doi:10.1371/journal.pone.0008277)
53. Pecinka, A. & Mittelsten Scheid, O. 2012 Stress-induced chromatin changes: a critical
view on their heritability. Plant Cell Physiol. 53, 801–8. (doi:10.1093/pcp/pcs044)
54. Levine, M. T., Eckert, M. L. & Begun, D. J. 2011 Whole-genome expression plasticity
across tropical and temperate Drosophila melanogaster populations from Eastern
Australia. Mol. Biol. Evol. 28, 249–56. (doi:10.1093/molbev/msq197)
55. Levine, M. T. & Begun, D. J. 2008 Evidence of spatially varying selection acting on four
chromatin-remodeling loci in Drosophila melanogaster. Genetics 179, 475–85.
(doi:10.1534/genetics.107.085423)
56. Kolaczkowski, B., Kern, A. D., Holloway, A. K. & Begun, D. J. 2011 Genomic differentiation
between temperate and tropical Australian populations of Drosophila melanogaster.
Genetics 187, 245–60. (doi:10.1534/genetics.110.123059)
63
Supplementary 2.1: Methods details
Mosquito samples
Mosquito samples were taken from urban and suburban areas in Sacramento (California, US),
Moscow and Aleksin (Central Russia). This strategy was based on the general supposition that
urban samples would correspond to the molestus from and the suburban samples to the
pipiens form of Culex pipiens.
Aleksin samples. The Aleksin urban sample (A1) consisted of mosquito larvae collected from an
above-ground pond in the water-purification station located at center of Aleksin, surrounded by
a small forested park. Samples were collected in August 2011. The suburban sample from
Aleksin (A4) consisted of larvae collected from barrels at the garden plots in the holiday village
Bogucharovo, 11 km southeast of Aleksin. The location is surrounded by forests and farmlands.
There was no visible constraint except distance which could limit gene flow between natural
and urban populations from Aleksin.
Moscow samples. The Moscow urban sample (M1) consisted of larvae taken in autumn 2011
from a laboratory culture of Cx. p. f. molestus maintained at the Biological Evolution
Department of Moscow State University. Initial individuals for the culture had been caught in
autumn 2008 in the basement puddle of a Moscow house. Two C. torrentium samples were
taken from natural habitats around Moscow. M2 sample consisted of larvae and imagoes
collected in July 2011 in the Semkhoz train station near Sergiev Posad situated 60 km to the
north from Moscow. Larvae were taken from a wash-basin located near the forest specifically
for mosquito collection. M4 were collected at MSU Zvenigorod Biological station which is
situated 50 km to the west from Moscow. This station has for a century played the role of a
relatively undisturbed near-Moscow territory for biological studies with many relict habitats
and ecosystems. Larvae were collected in July 2011.
Sacramento samples. The Sacramento urban sample consisted of male specimens caught by
vacuum aspiration at a utility hole in Old Sacramento in August 2011. The suburban samples
were taken from catch basins at Sacramento zoo (August 2011) and Bartley Cavanaugh golf
course outside the urban district (April 2011). All sites are known to support autogenous
populations throughout the year.
Sequencing, SNP calling and Annotation
Sequencing. Genomic DNA were prepared from mosquitoes collected for each of the eight
populations into eight libraries and sequenced as paired-ended 101bp reads on an Illumina
HiSeq. This process generated 407 million reads and 41 billion base pairs of nucleotide
sequences. Sequenced reads were aligned as pairs using BWA 0.5.7 [1] to the complete Culex
quinquefasciatus draft genome downloaded from the Broad Institute. Reads were allowed up
64
to 12 mismatches throughout the 101bp per end, and unique reads were mapped to the
genome. Unique reads were defined as those that mapped to only one position in the
reference and were identified as having the "XT:A:U" tag. All other BWA alignment parameters
were set to default values. Approximately 61% of the sequenced reads mapped uniquely to the
C. quinquefasciatus draft genome, resulting in 42X total coverage of 461MB of the genome.
Median and average coverage ranged 3-6X and 2-8X across the eight samples, respectively.
SNP-calling and Annotation. SNP calling was done using the GATK Unified Genotyper [2] after
base quality score recalibration, indel realignment, and duplicate removal across all eight
samples simultaneously [3]. We detected 6,685,360 segregating sites that contained more than
one allele among the samples, of which all alleles were called either fixed in one population or
existing in more than one population. Polymorphic and monomorphic sites identified by this
method were subsequently used for calculation of F st and creation of phylogenetic trees.
Population genetic analyses
Species and form designation. Reads mapping to the Barcode region of the Cytochrome Oxidase
subunit I gene was used to check species identities [4,5]. The CQ11 microsatellite locus was
likewise analyzed to identify the samples as pipiens or molestus form [6,7].
Genome-wide distribution of variation. The two polymorphism parameters π and θ were
calculated for 10kb non-overlapping sliding windows using the software Popoolation version
1.2.2 [8]. This software incorporates methods to correct for biases due to pooled sequencing in
estimation of the aforementioned parameters. Only positions with coverage in the range of 4-
40X were used and the minimal legitimate count for the minor allele was set to 2. Synonymous
and nonsynonymous polymorphisms were assigned using the same software and the .gff file
downloaded from the Broad Institute website.
Population differentiation and admixture. F st was calculated for 10kb sliding windows between
each pair of populations according to the methods used in [9,10] and averaged across the
genome. Maximum likelihood phylogenetic trees were constructed from sliding windows of
non-overlapping 10kb using RAxML [11]. Only positions monomorphic in all samples were
included. The resulting trees were subsequently examined to find neighborhood status of
samples across the sliding windows. F st is calculated using information from both polymorphic
and monomorphic sites. Through the exclusive use of monomorphic sites in creation of the
phylogenetic trees, we attempted to capture deeper differentiation events that have resulted
in fixed differences among populations.
PCA on allele frequencies was also performed to examine population structure, once with all
the 8 populations and once only with the 6 C. pipiens ones. In either case, only biallelic
positions with coverage 4-40X were included in the analysis. Allele frequencies were calculated
for the wild type allele at each position (not necessarily the reference allele). Individual allele
65
frequencies at each position were centered on the mean of frequencies of that position across
populations. PCA was done on these mean-centered frequencies of wild type alleles.
Divergence (proportion of “fixed” differences) of each of the populations from the reference
sequence was calculated as the fraction of sites with coverage 6 ≤n ≤35 where at least n-1 bases
were one type of derived (non-reference) base. In other words, only one base other than the
major derived allele was allowed for a site to be considered fixed for the derived allele. The
reason for choosing the coverage range of [6-35] was that in this interval, with one mismatch,
the null hypothesis of fixation for a derived allele and sequencing error rate of 1% (typical of
Illumina) would not be rejected with a one-tailed binomial test at α=0.05.
Genomic signatures of natural selection. Two different methods were used to identify regions
under selection. Tajima’s D was obtained for genes and for 10kb non-overlapping sliding
windows using Popoolation 1.2.2. The same coverage and minor allele count filter as for the
calculation of pi and theta (above) was applied. Alternatively, a Hidden Markov Model-based
model incorporated into the software package Pool-hmm was used to identify the selective
sweep regions [12]. To parallelize the Pool-hmm process, it was run in two steps. First, allele
frequency spectrum (AFS) was built based on the whole genome for each sample with the
acceptable coverage range of 4-40X, theta=0.02 (based on the Popoolation output, see results)
and sampling ratio of 20 (5% of positions were used for estimation of AFS). Second, sweep
regions were detected separately for each supercontig (parallelized) with the same coverage
range as above and transition probability of k=1e-6 based on the AFS created in the previous
step.
Gene Ontology (GO) enrichment analysis on targets of selection was performed using the online
software GOEAST [13]. GO annotations for Culex genes were downloaded from
vectorbase.org/biomart. The annotation file was slightly reformatted in Python to be usable by
GOEAST, and was used as the default background set. For each population two enrichment
tests were run with different selected gene sets: 1) 200 genes (~1% of the total number of
genes in the genome) with highest pool-hmm scores, and 2) 200 genes with lowest Tajima’s D.
Enrichments with FDR<0.1 were regarded as significant. There is a body of literature on gene
length bias in GO analysis of RNA-seq results [14,15]. The bias with the RNA-seq data stems
from higher power for detection of differential expression in longer genes. We set out to
investigate if our GO analysis on the genomic hits for selection also suffered this bias. For each
population, we had done the GO analyses on the 200 genes (~1% of total gene count) with
highest pool-hmm scores and 200 genes with lowest Tajima’s D. So, we decided to compare the
length of genes included in those selected groups with the rest of the genes. We created a flag
variable indicating whether a gene belonged to the group of 200 highest pool-hmm scores in
each of the 8 populations. Then, we performed a two-way ANOVA to test the association of
gene length with the state of this flag (0 or 1) and the population it came from. We repeated the
same test on the genes with lowest Tajima’s D.
66
Case study of histones
Secondary structure prediction and calculation of solvent accessibility were done via the online
Jpred server [16]. Delineation of domain boundaries was achieved through multiple sequence
alignment of Culex histones with similar sequences from human, chicken, the midge
Chironomus pallidivittatus and fruit fly – for which domain annotations were available on
Uniprot.
As a matter of interest, we checked for overrepresented aminoacid conversions among the
polymorphic positions. To avoid loss of power due to issues of multiple-comparison testing, we
subsampled the dataset for initial hypothesis generation. Two H1 paralogs from two
populations were selected at random and nonsynonymous substitutions were visually scanned
(no statistical tests). An unusually high number of conversions to proline was the most
conspicuous observation. Because of the well-known structural peculiarities of proline and its
indirect role in epigenetic modification of histones [17–19], we decided to test the hypothesis
of excess conversions to Pro in the main dataset formally. To do this, first we extracted the
composition of the reference codons that had converted to Pro in the histone genes. Then, we
subsampled the same compositions of codons 50 times from polymorphic positions elsewhere
in the genome and compared the number of conversions to Pro with the number observed for
the histone block by a t test.
Statistical procedures
All statistical analyses including calculation of descriptive statistics, correlation tests and
principal component analysis (PCA) were done using SAS v9.3 and SAS JMP Pro 10.0.0.
References:
1. Li, H. & Durbin, R. 2009 Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–60. (doi:10.1093/bioinformatics/btp324)
2. McKenna, A. et al. 2010 The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–303.
(doi:10.1101/gr.107524.110)
3. DePristo, M. a et al. 2011 A framework for variation discovery and genotyping using next-
generation DNA sequencing data. Nat. Genet. 43, 491–8. (doi:10.1038/ng.806)
4. Hesson, J. C., Lundström, J. O., Halvarsson, P., Erixon, P. & Collado, A. 2010 A sensitive
and reliable restriction enzyme assay to distinguish between the mosquitoes Culex
torrentium and Culex pipiens. Med. Vet. Entomol. 24, 142–9. (doi:10.1111/j.1365-
2915.2010.00871.x)
67
5. Hebert, P. D. N., Cywinska, A., Ball, S. L. & deWaard, J. R. 2003 Biological identifications
through DNA barcodes. Proc. Biol. Sci. 270, 313–21. (doi:10.1098/rspb.2002.2218)
6. Bahnck, C. M. & Fonseca, D. M. 2006 Rapid assay to identify the two genetic forms of
Culex (Culex) pipiens L. (Diptera: Culicidae) and hybrid populations. Am. J. Trop. Med.
Hyg. 75, 251–5.
7. Huang, S., Molaei, G. & Andreadis, T. G. 2008 Genetic insights into the population
structure of Culex pipiens (Diptera: Culicidae) in the Northeastern United States by using
microsatellite analysis. Am. J. Trop. Med. Hyg. 79, 518–27.
8. Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., Kosiol,
C. & Schlötterer, C. 2011 PoPoolation: a toolbox for population genetic analysis of next
generation sequencing data from pooled individuals. PLoS One 6, e15925.
(doi:10.1371/journal.pone.0015925)
9. Remolina, S. C., Chang, P. L., Leips, J., Nuzhdin, S. V & Hughes, K. A. 2012 GENOMIC BASIS
OF AGING AND LIFE-HISTORY EVOLUTION IN DROSOPHILA MELANOGASTER. Evolution (N.
Y). 66, 3390–3403. (doi:10.5061/dryad.94pv0)
10. Jalvingh, K. M., Chang, P. L., Nuzhdin, S. V & Wertheim, B. 2014 Genomic changes under
rapid evolution: selection for parasitoid resistance. Proc. Biol. Sci. 281, 20132303.
(doi:10.1098/rspb.2013.2303)
11. Stamatakis, A. 2006 RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses
with thousands of taxa and mixed models. Bioinformatics 22, 2688–90.
(doi:10.1093/bioinformatics/btl446)
12. Boitard, S., Kofler, R., Françoise, P., Robelin, D., Schlötterer, C. & Futschik, A. 2013 Pool-
hmm: a Python program for estimating the allele frequency spectrum and detecting
selective sweeps from next generation sequencing of pooled samples. Mol. Ecol. Resour.
13, 337–40. (doi:10.1111/1755-0998.12063)
13. Zheng, Q. & Wang, X.-J. 2008 GOEAST: a web-based software toolkit for Gene Ontology
enrichment analysis. Nucleic Acids Res. 36, W358–63. (doi:10.1093/nar/gkn276)
14. Young, M. D., Wakefield, M. J., Smyth, G. K. & Oshlack, A. 2010 Gene ontology analysis
for RNA-seq: accounting for selection bias. Genome Biol. 11, R14. (doi:10.1186/gb-2010-
11-2-r14)
15. Mi, G., Di, Y., Emerson, S., Cumbie, J. S. & Chang, J. H. 2012 Length Bias Correction in
Gene Ontology Enrichment Analysis Using Logistic Regression. PLoS One 7.
(doi:10.1371/journal.pone.0046128)
68
16. Cole, C., Barber, J. D. & Barton, G. J. 2008 The Jpred 3 secondary structure prediction
server. Nucleic Acids Res. 36, W197–201. (doi:10.1093/nar/gkn238)
17. Raghuram, N., Strickfaden, H., McDonald, D., Williams, K., Fang, H., Mizzen, C., Hayes, J.
J., Th’ng, J. & Hendzel, M. J. 2013 Pin1 promotes histone H1 dephosphorylation and
stabilizes its binding to chromatin. J. Cell Biol. 203, 57–71. (doi:10.1083/jcb.201305159)
18. Nelson, C. J., Santos-Rosa, H. & Kouzarides, T. 2006 Proline isomerization of histone H3
regulates lysine methylation and gene expression. Cell 126, 905–16.
(doi:10.1016/j.cell.2006.07.026)
19. Harshman, S. W., Young, N. L., Parthun, M. R. & Freitas, M. A. 2013 H1 histones: current
perspectives and challenges. Nucleic Acids Res. , 1–17. (doi:10.1093/nar/gkt700)
69
Supplementary 2.2: PCA of allele frequencies
a b
c d
Figure Suppl. 2. PCA results on allele frequencies at biallelic positions within the six C. pipiens populations (a,c) and the eight Culex
populations (b,d). The “fwc” prefix stands for frequency of wild type allele, centered on the mean. Satisfying the conditions of being
biallelic and having 4-40X coverage in all of the tested populations, 1’890’722 positions were used for (a,c) and 814’469 positions were
used for (b,d).
70
Supplementary 2.3: Summary of Tajima’s D and Pool-hmm
Histograms and average Tajima’s D values for the 8 populations
A1; average=-0.1896 A4; average=-0.1650
M1; average=-0.2576 M2; average=-0.4509 M4; average=-0.0873
S1; average=-0.0744 S2; average=-0.2083 S3; average=-0.2129
71
Distribution of lengths and scores of the hits detected by Pool-hmm
A1
A4
M1 M2
72
M4 S1
S2
S3
73
Supplementary 2.4: Functional analysis of sweep genes
Gene Ontology (GO) enrichment analysis
For each population, 200 genes (~1% of the number of annotated genes in the genome) with
the highest Pool-hmm scores or the most negative Tajima’s D values were analyzed for
enrichment of GO terms. Only terms with False Discovery Rate (FDR)<0.1 are listed.
The ANOVA test for the association of gene length with Pool-hmm was significant, but the
effect was extremely small (p=0.0002, R
2
=0.000256). A very similar result was obtained with the
genes having the lowest Tajima’s D (p<0.0001, R
2
=0.000899). These tests clearly proved that
gene length had not been a substantial confounder in our GO analyses on selection targets
detected by either Pool-hmm or Tajima’s D method. Details of how the ANOVA test was
performed is provided in Suppl. 2.1.
GO enrichment of top 200 Pool-hmm hits for each population:
A1
GOID Ontology Term Log odds-
ratio
FDR
A4
GOID Ontology Term Log odds-
ratio
FDR
GO:0000785 cellular_component chromatin 5.299 6.14E-99
GO:0034728 biological_process nucleosome organization 5.299 6.14E-99
GO:0071824 biological_process protein-DNA complex subunit
organization
5.299 6.14E-99
GO:0006325 biological_process chromatin organization 4.685 3.98E-84
GO:0044427 cellular_component chromosomal part 4.575 1.86E-81
GO:0005694 cellular_component chromosome 4.426 4.78E-79
GO:0051276 biological_process chromosome organization 4.207 6.72E-75
GO:0043933 biological_process macromolecular complex subunit
organization
4.095 4.15E-71
GO:0006996 biological_process organelle organization 2.977 4.83E-51
GO:0043228 cellular_component non-membrane-bounded organelle 2.778 2.09E-46
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
2.778 2.09E-46
GO:0044446 cellular_component intracellular organelle part 2.442 1.80E-42
GO:0044422 cellular_component organelle part 2.388 8.46E-42
GO:0071840 biological_process cellular component organization or
biogenesis
2.395 1.64E-39
GO:0016043 biological_process cellular component organization 2.439 2.44E-39
GO:0003676 molecular_function nucleic acid binding 1.993 2.11E-28
74
GO:0043231 cellular_component intracellular membrane-bounded
organelle
1.51 3.16E-23
GO:0043227 cellular_component membrane-bounded organelle 1.476 7.70E-23
GO:0005622 cellular_component intracellular 1.119 5.24E-22
GO:0044424 cellular_component intracellular part 1.128 6.73E-22
GO:0043229 cellular_component intracellular organelle 1.314 1.54E-21
GO:0044763 biological_process single-organism cellular process 1.418 3.16E-21
GO:0043226 cellular_component organelle 1.299 3.32E-21
GO:1901363 molecular_function heterocyclic compound binding 1.325 1.80E-19
GO:0097159 molecular_function organic cyclic compound binding 1.319 2.26E-19
GO:0044699 biological_process single-organism process 1.107 1.56E-15
GO:0044444 cellular_component cytoplasmic part 1.935 4.66E-14
GO:0005623 cellular_component cell 0.776 1.01E-13
GO:0044464 cellular_component cell part 0.776 1.01E-13
GO:0005737 cellular_component cytoplasm 1.893 1.28E-13
GO:0009987 biological_process cellular process 0.806 8.51E-13
GO:0043189 cellular_component H4/H2A histone acetyltransferase
complex
5.379 1.55E-08
GO:0005575 cellular_component cellular_component 0.498 6.74E-08
GO:0005488 molecular_function binding 0.456 1.58E-07
GO:0000123 cellular_component histone acetyltransferase complex 4.478 2.31E-06
GO:0008150 biological_process biological_process 0.302 4.87E-04
GO:0044428 cellular_component nuclear part 1.428 5.47E-02
GO:0005654 cellular_component nucleoplasm 2.241 7.78E-02
GO:0044451 cellular_component nucleoplasm part 2.241 7.78E-02
GO:0005634 cellular_component nucleus 1.357 8.39E-02
M1
GOID Ontology Term Log odds-
ratio
FDR
M2
GOID Ontology Term Log odds-
ratio
FDR
M4
GOID Ontology Term Log odds-
ratio
FDR
GO:0000785 cellular_component chromatin 5.362 3.82E-99
GO:0034728 biological_process nucleosome organization 5.362 3.82E-99
GO:0071824 biological_process protein-DNA complex subunit
organization
5.362 3.82E-99
GO:0006325 biological_process chromatin organization 4.768 1.74E-86
GO:0044427 cellular_component chromosomal part 4.638 5.20E-82
75
GO:0005694 cellular_component chromosome 4.488 1.08E-79
GO:0051276 biological_process chromosome organization 4.27 1.17E-75
GO:0043933 biological_process macromolecular complex subunit
organization
4.118 1.15E-68
GO:0006996 biological_process organelle organization 2.987 1.14E-48
GO:0043228 cellular_component non-membrane-bounded organelle 2.768 3.90E-43
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
2.768 3.90E-43
GO:0016043 biological_process cellular component organization 2.486 1.33E-39
GO:0071840 biological_process cellular component organization or
biogenesis
2.425 6.88E-39
GO:0044446 cellular_component intracellular organelle part 2.386 8.34E-37
GO:0044422 cellular_component organelle part 2.333 2.87E-36
GO:0003676 molecular_function nucleic acid binding 2.146 2.24E-34
GO:0043227 cellular_component membrane-bounded organelle 1.604 6.86E-28
GO:0043231 cellular_component intracellular membrane-bounded
organelle
1.623 1.73E-27
GO:0044763 biological_process single-organism cellular process 1.547 4.64E-26
GO:0043229 cellular_component intracellular organelle 1.368 7.09E-23
GO:0043226 cellular_component organelle 1.353 1.60E-22
GO:0044699 biological_process single-organism process 1.234 5.60E-20
GO:1901363 molecular_function heterocyclic compound binding 1.36 7.91E-20
GO:0097159 molecular_function organic cyclic compound binding 1.355 9.87E-20
GO:0044444 cellular_component cytoplasmic part 2.107 2.03E-17
GO:0005737 cellular_component cytoplasm 2.065 6.69E-17
GO:0044424 cellular_component intracellular part 1.039 2.70E-16
GO:0005622 cellular_component intracellular 1.018 8.13E-16
GO:0009987 biological_process cellular process 0.81 3.48E-12
GO:0005623 cellular_component cell 0.707 5.06E-10
GO:0044464 cellular_component cell part 0.707 5.06E-10
GO:0043189 cellular_component H4/H2A histone acetyltransferase
complex
5.461 1.08E-08
GO:0000123 cellular_component histone acetyltransferase complex 4.561 1.71E-06
GO:0005488 molecular_function binding 0.437 2.86E-06
GO:0005575 cellular_component cellular_component 0.451 1.10E-05
GO:0016863 molecular_function intramolecular oxidoreductase activity,
transposing C=C bonds
5.824 1.40E-03
GO:0006570 biological_process tyrosine metabolic process 5.146 7.47E-03
GO:0008150 biological_process biological_process 0.262 1.56E-02
GO:0009072 biological_process aromatic amino acid family metabolic
process
4.561 2.70E-02
GO:0016860 molecular_function intramolecular oxidoreductase activity 4.239 5.06E-02
GO:0005654 cellular_component nucleoplasm 2.324 5.09E-02
GO:0044451 cellular_component nucleoplasm part 2.324 5.09E-02
76
S1
GOID Ontology Term Log odds-
ratio
FDR
GO:0000785 cellular_component chromatin 5.262 1.71E-96
GO:0034728 biological_process nucleosome organization 5.262 1.71E-96
GO:0071824 biological_process protein-DNA complex subunit
organization
5.262 1.71E-96
GO:0006325 biological_process chromatin organization 4.647 5.79E-82
GO:0044427 cellular_component chromosomal part 4.518 1.31E-77
GO:0005694 cellular_component chromosome 4.348 1.22E-73
GO:0051276 biological_process chromosome organization 4.131 9.95E-70
GO:0043933 biological_process macromolecular complex subunit
organization
4.058 3.56E-69
GO:0006996 biological_process organelle organization 2.868 8.67E-45
GO:0043228 cellular_component non-membrane-bounded organelle 2.724 8.98E-44
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
2.724 8.98E-44
GO:0016043 biological_process cellular component organization 2.331 6.40E-34
GO:0003676 molecular_function nucleic acid binding 2.063 2.46E-32
GO:0071840 biological_process cellular component organization or
biogenesis
2.252 2.56E-32
GO:0044422 cellular_component organelle part 2.196 1.04E-31
GO:0044446 cellular_component intracellular organelle part 2.23 2.18E-31
GO:1901363 molecular_function heterocyclic compound binding 1.352 5.82E-21
GO:0097159 molecular_function organic cyclic compound binding 1.347 7.27E-21
GO:0044763 biological_process single-organism cellular process 1.304 8.37E-17
GO:0043227 cellular_component membrane-bounded organelle 1.279 3.36E-15
GO:0043231 cellular_component intracellular membrane-bounded
organelle
1.293 6.08E-15
GO:0043229 cellular_component intracellular organelle 1.114 1.38E-13
GO:0043226 cellular_component organelle 1.099 2.60E-13
GO:0044444 cellular_component cytoplasmic part 1.886 3.33E-13
GO:0005737 cellular_component cytoplasm 1.845 9.36E-13
GO:0044699 biological_process single-organism process 0.998 7.77E-12
GO:0005622 cellular_component intracellular 0.861 9.25E-11
GO:0044424 cellular_component intracellular part 0.867 1.06E-10
GO:0043189 cellular_component H4/H2A histone acetyltransferase
complex
5.361 1.38E-08
GO:0009987 biological_process cellular process 0.656 1.58E-07
GO:0000123 cellular_component histone acetyltransferase complex 4.461 2.10E-06
GO:0005488 molecular_function binding 0.416 4.40E-06
GO:0005623 cellular_component cell 0.539 2.15E-05
GO:0044464 cellular_component cell part 0.539 2.15E-05
77
GO:0005654 cellular_component nucleoplasm 2.416 1.13E-02
GO:0044451 cellular_component nucleoplasm part 2.416 1.13E-02
S2
GOID Ontology Term Log odds-
ratio
FDR
S3
GOID Ontology Term Log odds-
ratio
FDR
GO:0032991 cellular_component macromolecular complex 1.062 5.32E-02
GO:0000166 molecular_function nucleotide binding 0.935 5.37E-02
GO:1901265 molecular_function nucleoside phosphate binding 0.935 5.37E-02
GO:0036094 molecular_function small molecule binding 0.913 5.66E-02
GO:0005622 cellular_component intracellular 0.54 6.52E-02
GO:0030529 cellular_component ribonucleoprotein complex 1.402 6.52E-02
GO:0012505 cellular_component endomembrane system 2.301 6.52E-02
GO enrichment of 200 genes with the most negative Tajima’s D for each population:
A1
GOID Ontology Term Log odds-ratio FDR
GO:0000785 cellular_component chromatin 2.941 4.55E-05
GO:0006996 biological_process organelle organization 1.655 4.55E-05
GO:0051276 biological_process chromosome organization 2.259 7.57E-05
GO:0034728 biological_process nucleosome organization 2.825 1.24E-04
GO:0043933 biological_process macromolecular complex subunit
organization
2.167 1.24E-04
GO:0071824 biological_process protein-DNA complex subunit organization 2.825 1.24E-04
GO:0044427 cellular_component chromosomal part 2.403 1.31E-04
GO:0043228 cellular_component non-membrane-bounded organelle 1.456 1.92E-04
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
1.456 1.92E-04
GO:0016043 biological_process cellular component organization 1.322 1.92E-04
GO:0005694 cellular_component chromosome 2.234 3.76E-04
GO:0071840 biological_process cellular component organization or
biogenesis
1.243 4.73E-04
GO:0044424 cellular_component intracellular part 0.648 7.97E-04
GO:0005622 cellular_component intracellular 0.627 1.36E-03
GO:0006325 biological_process chromatin organization 2.191 5.00E-03
GO:0032991 cellular_component macromolecular complex 1.053 5.00E-03
GO:0044422 cellular_component organelle part 1.039 1.91E-02
GO:0007049 biological_process cell cycle 1.638 3.15E-02
78
GO:0044446 cellular_component intracellular organelle part 1.017 4.04E-02
GO:0022402 biological_process cell cycle process 1.735 7.84E-02
A4
GOID Ontology Term Log odds-ratio FDR
GO:0000785 cellular_component chromatin 4.648 1.12E-45
GO:0034728 biological_process nucleosome organization 4.614 1.71E-44
GO:0071824 biological_process protein-DNA complex subunit organization 4.614 1.71E-44
GO:0044427 cellular_component chromosomal part 3.904 5.65E-36
GO:0006325 biological_process chromatin organization 3.979 5.65E-36
GO:0005694 cellular_component chromosome 3.767 3.96E-35
GO:0051276 biological_process chromosome organization 3.562 2.89E-33
GO:0043933 biological_process macromolecular complex subunit
organization
3.437 7.78E-31
GO:0006996 biological_process organelle organization 2.344 1.19E-19
GO:0003676 molecular_function nucleic acid binding 1.785 2.38E-18
GO:0071840 biological_process cellular component organization or
biogenesis
1.927 1.69E-17
GO:0016043 biological_process cellular component organization 1.955 3.85E-17
GO:0043228 cellular_component non-membrane-bounded organelle 2.115 1.80E-16
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
2.115 1.80E-16
GO:0044446 cellular_component intracellular organelle part 1.873 4.23E-16
GO:0044422 cellular_component organelle part 1.802 3.70E-15
GO:0044424 cellular_component intracellular part 0.885 3.31E-10
GO:0097159 molecular_function organic cyclic compound binding 1.086 3.31E-10
GO:1901363 molecular_function heterocyclic compound binding 1.091 3.31E-10
GO:0043231 cellular_component intracellular membrane-bounded organelle 1.171 3.31E-10
GO:0044444 cellular_component cytoplasmic part 1.794 3.31E-10
GO:0043227 cellular_component membrane-bounded organelle 1.143 4.76E-10
GO:0043229 cellular_component intracellular organelle 1.033 4.97E-10
GO:0005622 cellular_component intracellular 0.865 6.91E-10
GO:0005737 cellular_component cytoplasm 1.753 6.91E-10
GO:0043226 cellular_component organelle 1.018 7.76E-10
GO:0044763 biological_process single-organism cellular process 1.016 1.42E-07
GO:0005623 cellular_component cell 0.578 9.55E-06
GO:0044464 cellular_component cell part 0.578 9.55E-06
GO:0044699 biological_process single-organism process 0.708 3.38E-04
GO:0009987 biological_process cellular process 0.541 4.88E-04
GO:0005575 cellular_component cellular_component 0.318 4.11E-02
GO:0043189 cellular_component H4/H2A histone acetyltransferase complex 4.248 6.01E-02
79
M1
GOID Ontology Term Log odds-ratio FDR
M2
GOID Ontology Term Log odds-ratio FDR
GO:0005622 cellular_component intracellular 0.58 2.14E-02
GO:0044424 cellular_component intracellular part 0.582 2.14E-02
GO:0032991 cellular_component macromolecular complex 1.026 2.14E-02
GO:0044422 cellular_component organelle part 1.103 2.14E-02
GO:0044446 cellular_component intracellular organelle part 1.132 2.14E-02
GO:0043933 biological_process macromolecular complex subunit
organization
1.838 2.23E-02
GO:0006325 biological_process chromatin organization 2 9.43E-02
M4
GOID Ontology Term Log odds-ratio FDR
GO:0000785 cellular_component chromatin 4.178 2.25E-25
GO:0034728 biological_process nucleosome organization 4.178 2.25E-25
GO:0071824 biological_process protein-DNA complex subunit organization 4.178 2.25E-25
GO:0006325 biological_process chromatin organization 3.592 4.63E-21
GO:0044427 cellular_component chromosomal part 3.434 4.57E-19
GO:0043933 biological_process macromolecular complex subunit
organization
3.12 4.57E-19
GO:0005694 cellular_component chromosome 3.313 8.04E-19
GO:0051276 biological_process chromosome organization 3.123 6.95E-18
GO:0006996 biological_process organelle organization 1.997 2.44E-10
GO:0043228 cellular_component non-membrane-bounded organelle 1.876 5.12E-10
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
1.876 5.12E-10
GO:0044422 cellular_component organelle part 1.578 5.48E-09
GO:0044446 cellular_component intracellular organelle part 1.582 1.50E-08
GO:0071840 biological_process cellular component organization or
biogenesis
1.585 1.50E-08
GO:0003676 molecular_function nucleic acid binding 1.416 3.47E-08
GO:0016043 biological_process cellular component organization 1.594 3.52E-08
GO:0044444 cellular_component cytoplasmic part 1.57 5.86E-06
GO:0005737 cellular_component cytoplasm 1.529 1.09E-05
GO:0043227 cellular_component membrane-bounded organelle 0.945 3.52E-05
GO:0043231 cellular_component intracellular membrane-bounded organelle 0.942 6.56E-05
GO:0043189 cellular_component H4/H2A histone acetyltransferase complex 5.083 6.56E-05
GO:0044424 cellular_component intracellular part 0.689 1.49E-04
GO:0005622 cellular_component intracellular 0.668 2.66E-04
GO:0043226 cellular_component organelle 0.785 3.27E-04
80
GO:0043229 cellular_component intracellular organelle 0.753 1.16E-03
GO:0000123 cellular_component histone acetyltransferase complex 4.183 1.53E-03
GO:0044699 biological_process single-organism process 0.698 1.55E-03
GO:0044763 biological_process single-organism cellular process 0.797 2.21E-03
GO:0005623 cellular_component cell 0.49 3.07E-03
GO:0044464 cellular_component cell part 0.49 3.07E-03
GO:1901363 molecular_function heterocyclic compound binding 0.741 3.97E-03
GO:0097159 molecular_function organic cyclic compound binding 0.736 4.29E-03
GO:0009987 biological_process cellular process 0.503 6.25E-03
GO:0005654 cellular_component nucleoplasm 2.624 7.65E-03
GO:0044451 cellular_component nucleoplasm part 2.624 7.65E-03
GO:0031981 cellular_component nuclear lumen 2.019 2.95E-02
GO:0043233 cellular_component organelle lumen 1.924 4.72E-02
GO:0070013 cellular_component intracellular organelle lumen 1.924 4.72E-02
GO:0031974 cellular_component membrane-enclosed lumen 1.892 5.48E-02
GO:0005575 cellular_component cellular_component 0.318 7.02E-02
S1
GOID Ontology Term Log odds-ratio FDR
GO:0000785 cellular_component chromatin 4.921 2.37E-63
GO:0034728 biological_process nucleosome organization 4.921 2.37E-63
GO:0071824 biological_process protein-DNA complex subunit organization 4.921 2.37E-63
GO:0006325 biological_process chromatin organization 4.287 4.90E-52
GO:0044427 cellular_component chromosomal part 4.177 3.04E-50
GO:0051276 biological_process chromosome organization 3.875 4.28E-49
GO:0005694 cellular_component chromosome 4.035 5.55E-49
GO:0043933 biological_process macromolecular complex subunit
organization
3.677 4.21E-42
GO:0006996 biological_process organelle organization 2.559 1.16E-27
GO:0043228 cellular_component non-membrane-bounded organelle 2.283 6.42E-22
GO:0043232 cellular_component intracellular non-membrane-bounded
organelle
2.283 6.42E-22
GO:0016043 biological_process cellular component organization 2.052 9.35E-21
GO:0003676 molecular_function nucleic acid binding 1.822 2.55E-20
GO:0071840 biological_process cellular component organization or
biogenesis
1.973 1.34E-19
GO:0044446 cellular_component intracellular organelle part 1.946 8.29E-19
GO:0044422 cellular_component organelle part 1.875 8.93E-18
GO:1901363 molecular_function heterocyclic compound binding 1.138 5.96E-12
GO:0097159 molecular_function organic cyclic compound binding 1.132 6.92E-12
GO:0043231 cellular_component intracellular membrane-bounded organelle 1.143 6.99E-10
GO:0043227 cellular_component membrane-bounded organelle 1.094 3.42E-09
81
GO:0044763 biological_process single-organism cellular process 1.03 3.90E-08
GO:0043229 cellular_component intracellular organelle 0.929 9.37E-08
GO:0005622 cellular_component intracellular 0.772 1.47E-07
GO:0043226 cellular_component organelle 0.914 1.47E-07
GO:0044424 cellular_component intracellular part 0.76 4.57E-07
GO:0009987 biological_process cellular process 0.622 3.83E-06
GO:0043189 cellular_component H4/H2A histone acetyltransferase complex 4.957 3.74E-05
GO:0044699 biological_process single-organism process 0.741 4.95E-05
GO:0005737 cellular_component cytoplasm 1.309 2.72E-04
GO:0044444 cellular_component cytoplasmic part 1.302 4.40E-04
GO:0000123 cellular_component histone acetyltransferase complex 4.057 9.53E-04
GO:0005623 cellular_component cell 0.46 2.32E-03
GO:0044464 cellular_component cell part 0.46 2.32E-03
GO:0006397 biological_process mRNA processing 2.413 1.07E-02
GO:0005488 molecular_function binding 0.296 2.33E-02
GO:0016071 biological_process mRNA metabolic process 2.184 2.98E-02
GO:0005575 cellular_component cellular_component 0.29 7.16E-02
S2
GOID Ontology Term Log odds-ratio FDR
S3
GOID Ontology Term Log odds-ratio FDR
GO:0035050 biological_process embryonic heart tube development 5.872 6.94E-02
82
Selective sweeps in genes with experimentally verified functions
In the second column, numbers given in parentheses represent the number of genes from the relevant
group showing sweep signals (Pool-hmm score >4) in each population.
Gene(s) Populations with
Pool-hmm score>4
Adaptive phenotype References
Histones
A1(76), A4(77), M1(63), M2(77),
M4(76), S1(79), S2(77), S3(34)
Regulation of genome-wide or
regional gene expression
See the section on histones later
in this article.
Chromatin remodeling factors Regulation of genome-wide or
regional gene expression
[1–6]
Histone acetyltransferase PCAF A1, A4, M1, S1, S2
Histone deacetylases* A1(2), M1(3), M2(4), M4(1),
S1(2), S2(1)
Histone methyltransferases* A1(1), A4(1), M1(2), M2(2),
M4(1), S1(2), S2(2)
Histone demethylase* S3(1)
Chromatin assembly factor 1,
p-180 subunit
A1, A4, M1, M2, S1, S2, S3
ATP-dependent chromatin
assembly factor large subunit
A1, M1, M2, S1, S2
Heterochromatin protein 1 A1, S2, S3
Chromatin regulatory protein
sir2
M2(1), M4(1)
P450 cytochrome family A1(1), M1(65), M2(11), M4(7),
S1(13), S2(12), S3(18)
Insecticide resistance,
Insect hormone biosynthesis
(KEGG pathway #cqu00981)
[7–12]
Ribosomal proteins A1(31), A4(32), M1(57), M2(19),
M4(17), S1(70), S2(52), S3(63)
Insecticide resistance,
Induction of diapause
[13–18]
Chaperonins and heat shock
proteins
A1(14), A4(14), M1(21), M2(4),
M4(1), S1(28), S2(26), S3(36)
Adaptation to high and low
temperatures e.g. cold-
resistance during diapause
[19–23]
Vitellogenins and vitellogenin
convertase
A1(2), M1(4), M2(5), M4(2),
S1(1), S2(1), S3(1)
Regulation of female
reproduction
[24,25]
Cadherins A1(11), A4(11), M1(13), M2(4),
M4(1), S1(21), S2(21), S3(19)
Insecticide resistance,
Fertility
[26,27]
Superoxide dismutases A1(4), A4(2), M1(2), M2(1),
S1(5), S2(5), S3(7)
Protection of ovaries and
survival during diapause
[28]
Salivary proteins M1(17), M2(4), M4(1), S1(4),
S3(2)
Host immune response
modulation
[29]
* Only genes named as histone deactylase, histone methyltransferase or histone demethylase were counted. General deactylases, methyltransferases or
methylases which may act on histones also, were not included.
83
References:
1. Pecinka, A. & Mittelsten Scheid, O. 2012 Stress-induced chromatin changes: a critical view on
their heritability. Plant Cell Physiol. 53, 801–8. (doi:10.1093/pcp/pcs044)
2. Hirsch, S., Baumberger, R. & Grossniklaus, U. 2012 Epigenetic variation, inheritance, and
selection in plant populations. Cold Spring Harb. Symp. Quant. Biol. 77, 97–104.
(doi:10.1101/sqb.2013.77.014605)
3. Richards, E. J. 2008 Population epigenetics. Curr. Opin. Genet. Dev. 18, 221–6.
(doi:10.1016/j.gde.2008.01.014)
4. Levine, M. T., Eckert, M. L. & Begun, D. J. 2011 Whole-genome expression plasticity across
tropical and temperate Drosophila melanogaster populations from Eastern Australia. Mol. Biol.
Evol. 28, 249–56. (doi:10.1093/molbev/msq197)
5. Levine, M. T. & Begun, D. J. 2008 Evidence of spatially varying selection acting on four chromatin-
remodeling loci in Drosophila melanogaster. Genetics 179, 475–85.
(doi:10.1534/genetics.107.085423)
6. Kolaczkowski, B., Kern, A. D., Holloway, A. K. & Begun, D. J. 2011 Genomic differentiation
between temperate and tropical Australian populations of Drosophila melanogaster. Genetics
187, 245–60. (doi:10.1534/genetics.110.123059)
7. Sun, Y. et al. 2012 Functional characterization of an arrestin gene on insecticide resistance of
Culex pipiens pallens. Parasit. Vectors 5, 134. (doi:10.1186/1756-3305-5-134)
8. Shen, B., Dong, H.-Q., Tian, H.-S., Ma, L., Li, X.-L., Wu, G.-L. & Zhu, C.-L. 2003 Cytochrome P450
genes expressed in the deltamethrin-susceptible and -resistant strains of Culex pipiens pallens.
Pestic. Biochem. Physiol. 75, 19–26. (doi:10.1016/S0048-3575(03)00014-2)
9. Yan, L., Yang, P., Jiang, F., Cui, N., Ma, E., Qiao, C. & Cui, F. 2012 Transcriptomic and phylogenetic
analysis of Culex pipiens quinquefasciatus for three detoxification gene families. BMC Genomics
13, 609. (doi:10.1186/1471-2164-13-609)
10. Reddy, B. N., Rao, B. P., Prasad, G. & Raghavendra, K. 2012 Identification and classification of
detoxification enzymes from Culex quinquefasciatus (Diptera: Culicidae). Bioinformation 8, 430–
6. (doi:10.6026/97320630008430)
11. Itokawa, K., Komagata, O., Kasai, S., Kawada, H., Mwatele, C., Dida, G. O., Njenga, S. M.,
Mwandawiro, C. & Tomita, T. 2013 Global spread and genetic variants of the two CYP9M10
haplotype forms associated with insecticide resistance in Culex quinquefasciatus Say. Heredity
(Edinb). 111, 216–26. (doi:10.1038/hdy.2013.40)
12. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. 2012 KEGG for integration and
interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–14.
(doi:10.1093/nar/gkr988)
84
13. Kim, M. & Denlinger, D. L. 2010 A potential role for ribosomal protein S2 in the gene network
regulating reproductive diapause in the mosquito Culex pipiens. J. Comp. Physiol. B. 180, 171–8.
(doi:10.1007/s00360-009-0406-9)
14. Kim, M., Sim, C. & Denlinger, D. L. 2010 RNA interference directed against ribosomal protein S3a
suggests a link between this gene and arrested ovarian development during adult diapause in
Culex pipiens. Insect Mol. Biol. 19, 27–33. (doi:10.1111/j.1365-2583.2009.00926.x.RNA)
15. He, J. et al. 2009 Cloning and characterization of 60S ribosomal protein L22 (RPL22) from Culex
pipiens pallens. Comp. Biochem. Physiol. B. Biochem. Mol. Biol. 153, 216–22.
(doi:10.1016/j.cbpb.2009.03.003)
16. Hu, X., Wang, W., Zhang, D., Jiao, J., Tan, W., Sun, Y., Ma, L. & Zhu, C. 2007 Cloning and
characterization of 40S ribosomal protein S4 gene from Culex pipiens pallens. Comp. Biochem.
Physiol. B. Biochem. Mol. Biol. 146, 265–70. (doi:10.1016/j.cbpb.2006.11.011)
17. Tan, W. et al. 2006 Cloning and overexpression of ribosomal protein L39 gene from deltamethrin-
resistant Culex pipiens pallens. Exp. Parasitol. 115, 369–78.
18. Sun, H. et al. 2011 Cloning and characterization of ribosomal protein S29, a deltamethrin
resistance associated gene from Culex pipiens pallens. Parasitol. Res. 109, 1689–97.
(doi:10.1007/s00436-011-2443-z)
19. Ferrer, M., Chernikova, T. N., Yakimov, M. M., Golyshin, P. N. & Timmis, K. N. 2003 Chaperonins
govern growth of Escherichia coli at low temperatures. Nat. Biotechnol. 21, 1266–1267.
20. Fujiwara, S., Aki, R., Yoshida, M., Higashibata, H., Imanaka, T. & Fukuda, W. 2008 Expression
profiles and physiological roles of two types of molecular chaperonins from the
hyperthermophilic archaeon Thermococcus kodakarensis. Appl. Environ. Microbiol. 74, 7306–12.
(doi:10.1128/AEM.01245-08)
21. Somer, L., Shmulman, O., Dror, T., Hashmueli, S. & Kashi, Y. 2002 The eukaryote chaperonin CCT
is a cold shock protein in Saccharomyces cerevisiae. Cell Stress Chaperones 7, 47–54.
22. Kayukawa, T., Chen, B., Miyazaki, S., Itoyama, K., Shinoda, T. & Ishikawa, Y. 2005 Expression of
mRNA for the t-complex polypeptide-1, a subunit of chaperonin CCT, is upregulated in
association with increased cold hardiness in Delia antiqua. Cell Stress Chaperones 10, 204–10.
23. Kayukawa, T. & Ishikawa, Y. 2009 Chaperonin contributes to cold hardiness of the onion maggot
Delia antiqua through repression of depolymerization of actin at low temperatures. PLoS One 4,
e8277. (doi:10.1371/journal.pone.0008277)
24. Provost-Javier, K. N., Chen, S. & Rasgon, J. L. 2010 Vitellogenin gene expression in autogenous
Culex tarsalis. Insect Mol. Biol. 19, 423–9. (doi:10.1111/j.1365-2583.2010.00999.x)
25. Raikhel, A. S. et al. 2002 Molecular biology of mosquito vitellogenesis: from basic studies to
genetic engineering of antipathogen immunity. Insect Biochem. Mol. Biol. 32, 1275–86.
85
26. Morin, S. et al. 2003 Three cadherin alleles associated with resistance to Bacillus thuringiensis in
pink bollworm. Proc. Natl. Acad. Sci. U. S. A. 100, 5004–9. (doi:10.1073/pnas.0831036100)
27. Zhao, J., Jin, L., Yang, Y. & Wu, Y. 2010 Diverse cadherin mutations conferring resistance to
Bacillus thuringiensis toxin Cry1Ac in Helicoverpa armigera. Insect Biochem. Mol. Biol. 40, 113–8.
(doi:10.1016/j.ibmb.2010.01.001)
28. Sim, C. & Denlinger, D. L. 2011 Catalase and superoxide dismutase-2 enhance survival and
protect ovaries during overwintering diapause in the mosquito Culex pipiens. J. Insect Physiol. 57,
628–34. (doi:10.1016/j.jinsphys.2011.01.012)
29. Calvo, E., Mans, B. J., Andersen, J. F. & Ribeiro, J. M. C. 2006 Function and evolution of a
mosquito salivary protein family. J. Biol. Chem. 281, 1935–42. (doi:10.1074/jbc.M510359200)
86
Supplementary 2.5: Polymorphism pattern in histones
The overall polymorphism is lower in the histone block compared to the background genome
(only non-sweep genes included). Generally, histone H1 has a polymorphism level higher than
the more conserved nucleosomal histones but lower than the background. The ratio of
Nsyn/Syn, however, is similar for H1 and other histones, and is remarkably higher than that of
background.
A1 A4 M1 M2 M4 S1 S2 S3
Per gene Syn, Hist block 1.688 3.712 1.588 1.35 2.138 3.75 0.65 0.462
Per gene Nsyn, Hist block 2.425 4.4625 2.0125 1.4625 1.9125 5.575 0.8375 0.4
Nsyn/Syn, Hist block 1.437 1.202 1.268 1.083 0.895 1.487 1.288 0.865
Per gene total, Hist block 4.1125 8.175 3.6 2.8125 4.05 9.325 1.4875 0.8625
Per gene Syn, H1 in block 6.333 9.389 5.667 4.722 4.556 12.111 2.5 2
Per gene Nsyn, H1 in block 8.389 13 6.833 6.056 5.222 16.5 3.389 1.611
Nsyn/Syn, H1 in block 1.324 1.3846 1.206 1.282 1.146 1.362 1.355 0.806
Per gene total, H1 in block 14.722 22.389 12.5 10.778 9.778 28.611 5.889 3.611
Per gene Syn in
background 22.18 21.84 20.4 11.41 4.56 21.56 31.59 29.99
Per gene Nsyn in
background 10.34 9.05 13.07 8.9 3.2 8.56 15.24 16.6
Nsyn/Syn in background 0.466 0.414 0.641 0.780 0.702 0.397 0.482 0.554
Per gene total in
background 32.52 30.88 33.47 20.31 7.76 30.12 46.83 46.58
87
There are four histone H1 genes outside our focal 137kb block (CPIJ010358, CPIJ010363,
CPIJ14770 and CPIJ018062). The table below presents the polymorphism data for H1 genes
within the 137kb block and these four outsider genes. (CPIJ010363 has a sweep score>4 in M1,
and CPIJ018062 has sweep scores>4 in M1 and S2. These 3 cases were excluded from
calculations.) Evidently, the non-sweep H1 loci outside the block have more synonymous than
nonsynonymous polymorphisms, in contrast to their counterparts within the swept block.
Syn Nsyn
H1 in the block 851 1098
H1 outside the block 352 322
Chi-square=14.79 P<0.0001
88
Supplementary 2.6: Biochemical vs structural aspects of histone H1
polymorphisms
Description of variables:
Nsyn_YN: =1 if the residue is nonsynonymously polymorphic in at least one of the 8
populations; =0 otherwise.
Domains: C=C-terminal, G=globular, N=N-terminal
Secondary structures: E=extended (beta strand), H=helix, - = not E or H
B25: =B if less than 25% solvent accessibility; = - otherwise.
ST_YN: =1 if the residue allows addition or removal of Ser/Thr in at least one of the 8
populations; =0 otherwise.
Pro_abs: =the total number of populations with a +/- proline mutation at the given position.
89
Table of Domain by Nsyn_YN
Domain(Domain) Nsyn_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
C 1373
40.75
78.41
49.16
378
11.22
21.59
65.63
1751
51.97
G 995
29.53
94.76
35.62
55
1.63
5.24
9.55
1050
31.17
N 425
12.62
74.82
15.22
143
4.24
25.18
24.83
568
16.86
Total 2793
82.90
576
17.10
3369
100.00
Statistics for Table of Domain by Nsyn_YN
Statistic DF Value Prob
Chi-Square 2 155.2495 <.0001
Likelihood Ratio Chi-Square 2 182.8464 <.0001
Mantel-Haenszel Chi-Square 1 3.9720 0.0463
Phi Coefficient 0.2147
Contingency Coefficient 0.2099
Cramer's V 0.2147
Sample Size = 3369
90
Table of Secondary by Nsyn_YN
Secondary(Secondary) Nsyn_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
- 2071
61.47
78.87
74.15
555
16.47
21.13
96.35
2626
77.95
E 119
3.53
96.75
4.26
4
0.12
3.25
0.69
123
3.65
H 603
17.90
97.26
21.59
17
0.50
2.74
2.95
620
18.40
Total 2793
82.90
576
17.10
3369
100.00
Statistics for Table of Secondary by Nsyn_YN
Statistic DF Value Prob
Chi-Square 2 136.9787 <.0001
Likelihood Ratio Chi-Square 2 182.3703 <.0001
Mantel-Haenszel Chi-Square 1 130.7704 <.0001
Phi Coefficient 0.2016
Contingency Coefficient 0.1977
Cramer's V 0.2016
Sample Size = 3369
91
Table of B25 by Nsyn_YN
B25(B25) Nsyn_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
- 2108
62.57
80.43
75.47
513
15.23
19.57
89.06
2621
77.80
B 685
20.33
91.58
24.53
63
1.87
8.42
10.94
748
22.20
Total 2793
82.90
576
17.10
3369
100.00
Statistics for Table of B25 by Nsyn_YN
Statistic DF Value Prob
Chi-Square 1 51.0438 <.0001
Likelihood Ratio Chi-Square 1 58.0570 <.0001
Continuity Adj. Chi-Square 1 50.2602 <.0001
Mantel-Haenszel Chi-Square 1 51.0287 <.0001
Phi Coefficient -0.1231
Contingency Coefficient 0.1222
Cramer's V -0.1231
Fisher's Exact Test
Cell (1,1) Frequency (F) 2108
Left-sided Pr <= F 2.223E-14
Right-sided Pr >= F 1.0000
Table Probability (P) 1.395E-14
Two-sided Pr <= P 4.063E-14
Sample Size = 3369
92
Fisher's Exact Test
Table of B25 by Charge_YN
B25(B25) Charge_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
- 2494
74.03
95.15
77.05
127
3.77
4.85
96.21
2621
77.80
B 743
22.05
99.33
22.95
5
0.15
0.67
3.79
748
22.20
Total 3237
96.08
132
3.92
3369
100.00
Statistics for Table of B25 by Charge_YN
Statistic DF Value Prob
Chi-Square 1 26.9704 <.0001
Likelihood Ratio Chi-Square 1 37.3255 <.0001
Continuity Adj. Chi-Square 1 25.8723 <.0001
Mantel-Haenszel Chi-Square 1 26.9624 <.0001
Phi Coefficient -0.0895
Contingency Coefficient 0.0891
Cramer's V -0.0895
Fisher's Exact Test
Cell (1,1) Frequency (F) 2494
Left-sided Pr <= F 1.622E-09
Right-sided Pr >= F 1.0000
Table Probability (P) 1.416E-09
Two-sided Pr <= P 2.837E-09
Sample Size = 3369
93
Table of B25 by Pro_YN
B25(B25) Pro_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
- 2390
70.94
91.19
76.87
231
6.86
8.81
88.85
2621
77.80
B 719
21.34
96.12
23.13
29
0.86
3.88
11.15
748
22.20
Total 3109
92.28
260
7.72
3369
100.00
Statistics for Table of B25 by Pro_YN
Statistic DF Value Prob
Chi-Square 1 19.9113 <.0001
Likelihood Ratio Chi-Square 1 22.9453 <.0001
Continuity Adj. Chi-Square 1 19.2242 <.0001
Mantel-Haenszel Chi-Square 1 19.9054 <.0001
Phi Coefficient -0.0769
Contingency Coefficient 0.0767
Cramer's V -0.0769
Fisher's Exact Test
Cell (1,1) Frequency (F) 2390
Left-sided Pr <= F 1.405E-06
Right-sided Pr >= F 1.0000
Table Probability (P) 8.371E-07
Two-sided Pr <= P 2.620E-06
Sample Size = 3369
94
The calculations in the table below address this question:
Given a residue is known to allow nonsynonymous polymorphism, will it be less permissive to
charge alteration if it is buried in the depth of the protein? Comparing the closeness of 46.08%
and 45.02%, with the big difference between 24.78% and 7.96%, the answer is YES.
Buried Exposed Source (this file)
N=% allow Nsyn change 8.42 19.57 Page 4
C=% allow charge alteration 0.67 4.85 Page 5
P=% allow +/- proline 3.88 8.81 Page 6
% C/N 7.96 24.78
% P/N 46.08 45.02
95
Table of B25 by ST_YN
B25(B25) ST_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
- 2490
73.91
95.00
77.26
131
3.89
5.00
89.73
2621
77.80
B 733
21.76
97.99
22.74
15
0.45
2.01
10.27
748
22.20
Total 3223
95.67
146
4.33
3369
100.00
Statistics for Table of B25 by ST_YN
Statistic DF Value Prob
Chi-Square 1 12.5718 0.0004
Likelihood Ratio Chi-Square 1 14.7985 0.0001
Continuity Adj. Chi-Square 1 11.8602 0.0006
Mantel-Haenszel Chi-Square 1 12.5680 0.0004
Phi Coefficient -0.0611
Contingency Coefficient 0.0610
Cramer's V -0.0611
Fisher's Exact Test
Cell (1,1) Frequency (F) 2490
Left-sided Pr <= F 1.064E-04
Right-sided Pr >= F 1.0000
Table Probability (P) 6.705E-05
Two-sided Pr <= P 2.146E-04
Sample Size = 3369
96
Table of Domain by Pro_YN
Domain(Domain) Pro_YN
Frequency
Percent
Row Pct
Col Pct 0 1 Total
C 1576
46.78
90.01
50.69
175
5.19
9.99
67.31
1751
51.97
G 1022
30.34
97.33
32.87
28
0.83
2.67
10.77
1050
31.17
N 511
15.17
89.96
16.44
57
1.69
10.04
21.92
568
16.86
Total 3109
92.28
260
7.72
3369
100.00
Statistics for Table of Domain by Pro_YN
Statistic DF Value Prob
Chi-Square 2 54.6410 <.0001
Likelihood Ratio Chi-Square 2 65.0931 <.0001
Mantel-Haenszel Chi-Square 1 5.2580 0.0218
Phi Coefficient 0.1274
Contingency Coefficient 0.1263
Cramer's V 0.1274
Sample Size = 3369
97
Table of Domain by Pro_abs
Domain(Domain) Pro_abs
Frequency
Percent
Row Pct
Col Pct 0 1 2 3 4 5 6 Total
C 1576
46.78
90.01
50.69
116
3.44
6.62
68.64
27
0.80
1.54
60.00
20
0.59
1.14
68.97
10
0.30
0.57
71.43
2
0.06
0.11
100.00
0
0.00
0.00
0.00
1751
51.97
G 1022
30.34
97.33
32.87
6
0.18
0.57
3.55
11
0.33
1.05
24.44
6
0.18
0.57
20.69
4
0.12
0.38
28.57
0
0.00
0.00
0.00
1
0.03
0.10
100.00
1050
31.17
N 511
15.17
89.96
16.44
47
1.40
8.27
27.81
7
0.21
1.23
15.56
3
0.09
0.53
10.34
0
0.00
0.00
0.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
568
16.86
Total 3109
92.28
169
5.02
45
1.34
29
0.86
14
0.42
2
0.06
1
0.03
3369
100.00
Statistics for Table of Domain by Pro_abs
Statistic DF Value Prob
Chi-Square 12 78.7084 <.0001
Likelihood Ratio Chi-Square 12 105.0427 <.0001
Mantel-Haenszel Chi-Square 1 8.1461 0.0043
Phi Coefficient 0.1528
Contingency Coefficient 0.1511
Cramer's V 0.1081
WARNING: 43% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Sample Size = 3369
Among the residues allowing Pro mutations, the proportion showing Pro mutations in more
than one population for each of the 3 domains:
C: 59/175=33.71% G: 22/28=78.57% N: 10/57=17.54%
98
Supplementary 2.7: Independence of polymorphic positions between
C. torrentium and C. pipiens
Description of variables:
%sample_Nsyn: =1 if the residue shows nonsynonymous polymorphism in the corresponding
sample; =0 otherwise.
%sample_Pro: =+1 if a mutation converts an alternative reference residue to proline; = -1 if a
mutation converts the reference proline residue to a different aminoacid; =0 otherwise.
99
P-values associated with Fisher’s exact test for independence of positions of nonsynonymous
polymorphisms
Fisher’s exact test, N = 3369, Total probability (P)
Pr <= P
A1_Nsyn A4_Nsyn M1_Nsyn M2_Nsyn M4_Nsyn S1_Nsyn S2_Nsyn S3_Nsyn
A1_Nsyn 4.19E-220
1.07E-225
4.751E-23
5.060E-24
8.184E-43
2.012E-44
0.1601
0.3627
0.2310
1.0000
1.538E-18
2.164E-19
1.775E-27
5.391E-29
7.907E-11
8.269E-11
A4_Nsyn 0.0000
0.0000
1.166E-13
1.329E-13
1.311E-05
1.691E-05
0.0030
0.0044
6.188E-38
6.369E-39
0.0026
0.0036
4.947E-08
5.388E-08
M1_Nsyn 1.76E-172
6.02E-178
0.0032
0.0041
0.2471
0.7247
5.695E-11
6.737E-11
7.868E-21
2.772E-22
1.665E-15
1.592E-15
M2_Nsyn 5.11E-155
1.99E-160
7.581E-05
8.954E-05
4.158E-05
5.545E-05
0.2464
0.6493
0.5055
1.0000
M4_Nsyn 1.92E-161
7.14E-167
2.099E-05
2.788E-05
0.2250
0.4033
0.4879
1.0000
S1_Nsyn 0.0000
0.0000
1.030E-10
1.170E-10
4.531E-04
5.415E-04
S2_Nsyn 6.27E-125
3.16E-130
2.474E-16
4.092E-18
S3_Nsyn 4.871E-67
5.029E-72
Note: Since 28 pairwise comparisons are made, the significance threshold should be considered
0.05/28=0.0018 after Bonferroni correction.
Nonsignificant tests indicate independent distribution of nonsynonymous positions in histone
H1 residues between the compared populations. The proportion of non-significant tests
(highlighted on the upper half of the table):
For intraspecific comparisons: 1/16.
For interspecific comparisons: 9/12.
100
P-values associated with Fisher’s exact test for independence of positions of proline
polymorphisms
Fisher’s exact test, N = 3369, Total probability (P)
Pr <= P
A1_Pro A4_Pro M1_Pro M2_Pro M4_Pro S1_Pro S2_Pro S3_Pro
A1_Pro 6.66E-115
6.66E-115
2.371E-27
2.221E-26
5.173E-37
1.830E-36
0.2508
0.4027
0.0506
0.0828
2.852E-19
6.624E-18
2.328E-14
3.839E-13
8.659E-11
3.743E-10
A4_Pro 4.701E-212
4.701E-212
2.044E-18
1.905E-17
0.0014
0.0094
1.332E-06
4.677E-06
1.257E-42
3.889E-41
1.334E-06
1.779E-05
6.264E-10
3.021E-09
M1_Pro 3.932E-108
3.932E-108
1.425E-04
7.671E-04
0.2595
0.3299
2.376E-10
4.736E-09
2.177E-13
2.051E-12
6.238E-11
2.546E-10
M2_Pro 1.310E-89
1.310E-89
0.0018
0.0037
1.860E-04
0.0021
0.0011
0.0057
0.8637
1.0000
M4_Pro 4.871E-67
5.029E-72
1.057E-07
3.769E-07
0.7980
1.0000
0.8858
1.0000
S1_Pro 2.928E-223
2.928E-223
7.882E-08
1.444E-06
4.601E-04
0.0015
S2_Pro 1.302E-70
1.302E-70
4.391E-11
2.017E-10
S3_Pro 2.636E-40
2.636E-40
Note: Since 28 pairwise comparisons are made, the significance threshold should be considered
0.05/28=0.0018 after Bonferroni correction.
Nonsignificant tests indicate independent distribution of nonsynonymous positions adding or
removing proline in histone H1 residues between the compared populations. The proportion of
non-significant tests (highlighted on the upper half of the table):
For intraspecific comparisons: 1/16.
For interspecific comparisons: 9/12.
101
Correlation analysis of nonsynonymous polymorphic sites across populations
Spearman Correlation Coefficients, N = 3369
Prob > |r| under H0: Rho=0
A1_Nsyn A4_Nsyn M1_Nsyn M2_Nsyn M4_Nsyn S1_Nsyn S2_Nsyn S3_Nsyn
A1_Nsyn 1.00000
0.23978
<.0001
0.41580
<.0001
0.01271
0.4607
0.00057
0.9737
0.20579
<.0001
0.33583
<.0001
0.18922
<.0001
A4_Nsyn 0.23978
<.0001
1.00000
0.17131
<.0001
0.09054
<.0001
0.05457
0.0015
0.29738
<.0001
0.05782
0.0008
0.13505
<.0001
M1_Nsyn 0.41580
<.0001
0.17131
<.0001
1.00000
0.06079
0.0004
-0.01401
0.4164
0.14524
<.0001
0.29968
<.0001
0.26515
<.0001
M2_Nsyn 0.01271
0.4607
0.09054
<.0001
0.06079
0.0004
1.00000
0.09096
<.0001
0.08285
<.0001
0.00912
0.5968
-0.01435
0.4052
M4_Nsyn 0.00057
0.9737
0.05457
0.0015
-0.01401
0.4164
0.09096
<.0001
1.00000
0.08636
<.0001
-0.02127
0.2171
-0.01472
0.3931
S1_Nsyn 0.20579
<.0001
0.29738
<.0001
0.14524
<.0001
0.08285
<.0001
0.08636
<.0001
1.00000
0.14950
<.0001
0.07617
<.0001
S2_Nsyn 0.33583
<.0001
0.05782
0.0008
0.29968
<.0001
0.00912
0.5968
-0.02127
0.2171
0.14950
<.0001
1.00000
0.32757
<.0001
S3_Nsyn 0.18922
<.0001
0.13505
<.0001
0.26515
<.0001
-0.01435
0.4052
-0.01472
0.3931
0.07617
<.0001
0.32757
<.0001
1.00000
Note: Since 28 pairwise comparisons are made, the significance threshold should be considered
0.05/28=0.0018 after Bonferroni correction.
Nonsignificant correlations indicate independent distribution of nonsynonymous positions in
histone H1 residues between the compared populations. The proportion of non-significant
correlations (highlighted on the upper half of the table):
For intraspecific comparisons: 0/16.
For interspecific comparisons: 7/12.
Notice that correlations are always positive when they are significant.
102
Correlation analysis of proline polymorphic sites across populations
Spearman Correlation Coefficients, N = 3369
Prob > |r| under H0: Rho=0
A1_Pro A4_Pro M1_Pro M2_Pro M4_Pro S1_Pro S2_Pro S3_Pro
A1_Pro 1.00000
0.33485
<.0001
0.49999
<.0001
0.01659
0.3357
0.04370
0.0112
0.25847
<.0001
0.27686
<.0001
0.26356
<.0001
A4_Pro 0.33485
<.0001
1.00000
0.25862
<.0001
0.07098
<.0001
0.13079
<.0001
0.38687
<.0001
0.12594
<.0001
0.20090
<.0001
M1_Pro 0.49999
<.0001
0.25862
<.0001
1.00000
0.06716
<.0001
0.01766
0.3055
0.16754
<.0001
0.25617
<.0001
0.27212
<.0001
M2_Pro 0.01659
0.3357
0.07098
<.0001
0.06716
<.0001
1.00000
0.08811
<.0001
0.08575
<.0001
0.06233
0.0003
-0.00405
0.8141
M4_Pro 0.04370
0.0112
0.13079
<.0001
0.01766
0.3055
0.08811
<.0001
1.00000
0.14585
<.0001
-0.00506
0.7692
-0.00516
0.7647
S1_Pro 0.25847
<.0001
0.38687
<.0001
0.16754
<.0001
0.08575
<.0001
0.14585
<.0001
1.00000
0.14201
<.0001
0.09372
<.0001
S2_Pro 0.27686
<.0001
0.12594
<.0001
0.25617
<.0001
0.06233
0.0003
-0.00506
0.7692
0.14201
<.0001
1.00000
0.31279
<.0001
S3_Pro 0.26356
<.0001
0.20090
<.0001
0.27212
<.0001
-0.00405
0.8141
-0.00516
0.7647
0.09372
<.0001
0.31279
<.0001
1.00000
Note: Since 28 pairwise comparisons are made, the significance threshold should be considered
0.05/28=0.0018 after Bonferroni correction.
Nonsignificant correlations –indicating independent distribution of positions of proline
polymorphisms in histone H1 residues between the compared populations- according to this
threshold are highlighted on the upper half of the table. The proportion of non-significant
correlations:
For intraspecific comparisons: 0/16.
For interspecific comparisons: 6/12.
Notice that correlations are always positive when they are significant.
103
Chapter 3. Toward high resolution
population genomics of archaeological
samples: A review
Summary
The term ‘ancient DNA’ (aDNA) is coming of age, with over 1200 hits in the PubMed database,
beginning in the early 1980s with the studies of ‘molecular paleontology.’ Rooted in cloning and
limited sequencing of DNA from ancient remains during the pre-PCR era, the field has made
incredible progress since the introduction of PCR and next-generation sequencing. Over the last
decade, aDNA analysis ushered in a new era in genomics and became the method of choice for
reconstructing the history of organisms, their biogeography, and migration routes, with
applications in evolutionary biology, population genetics, archaeogenetics, paleo-epidemiology,
and many other areas. This change was brought by development of new strategies for coping
with the challenges in studying aDNA due to damage and fragmentation, scarce samples,
significant historical gaps, and limited applicability of population genetics methods. In this
review, we describe the state-of-the-art achievements in aDNA studies, with particular focus on
human evolution and demographic history. We present the current experimental and
theoretical procedures for handling and analysing highly degraded aDNA. We also review the
challenges in the rapidly growing field of ancient epigenomics. Advancement of ancient DNA
tools and methods signifies a new era in population genetics and evolutionary medicine
research.
Keywords: ancient DNA, bioinformatics, epigenetics, population genetics, next generation
sequencing
104
3.1. Ancient DNA as an indispensable source of information
The passion for unravelling and reconstructing the history of life on Earth has always stimulated
research in evolutionary biology. Although inferences of past events such as the states of
ancestral organisms (e.g. ancestral sequences), evolutionary episodes (e.g. speciation), and the
dynamics governing change (e.g. mutation models) can be obtained through computational
phylogenetic and coalescent approaches using contemporary data, naturalists have always
valued direct observation above all other methods. Ancient DNA (aDNA) is thus expected to
revolutionize evolutionary genetics in the same manner that systematic approach to the
analysis of fossil records revolutionized palaeontology: it is a direct window into the past ‒ a
‘time capsule’. aDNA has already been invaluable in addressing many key questions in
evolutionary biology,
1-14
frequently providing the only available evidence.
Studies over the past several decades have demonstrated that aDNA can survive and be
extracted from ancient and historical material (e.g. bones, teeth, eggshells; mummified, frozen,
or artificially preserved tissues). The first attempts to extract and analyse aDNA were
performed before the PCR era. In a pioneer study in 1984, Higuchi et al.
15
managed to recover
DNA using bacterial cloning from dried muscle of quagga, an extinct subspecies of plains zebra
(Equus quagga). However, due to extremely poor DNA preservation, analyses of aDNA were
limited until an effective technology for DNA amplification, like PCR, made very small amounts
of DNA accessible for study. In addition, next-generation sequencing (NGS) technologies and
the plummeting cost of DNA sequencing have provided an unprecedented opportunity to
perform millions of sequencing reactions in parallel. These advances enabled the first report on
ancient sequences retrieved by NGS in 2004.
16
Major milestones of development of high
resolution ancient human genomics
1,2,5,11,12,15,17-29
are shown in Fig. 3.1.
105
Figure 3.1. Major milestones of development of high resolution ancient human genomics.
106
3.1.1. Human evolution and demographic history
Over the past decade, genomic techniques have been reshaping our fundamental
understanding of human prehistory and origins.
30,31
Until recently, much of what was known
about prehistory came from the study of archaeological sites and anthropological
investigations, piecing together patterns of human migration and admixture from physical
features, pottery, weapons, ornaments, art production, traditional customs, and studies of
modern DNA.
32
Other sources of information included linguistic classifications and ancient
texts. Although undeniably powerful, these approaches often yielded more questions than
answers, and their resolution required incorporation of additional data. Analysis of ancient
human remains can reveal migration patterns,
4,10-13
address questions of kinship and family
structure,
33
and provide insight into physiological or morphological characteristics such as blood
group, skin colour, hair type,
34-37
and climatic adaptation.
2
When combined with other
evidence, sequencing ancient genomes could help settle important debates within archaeology
or linguistics. This approach, although not infallible, is particularly valuable now when ancient
genetics is considered to be a highly robust tool and has significantly impacted many fields such
as forensics and history.
38
Sequencing of the genomes from archaic hominids have illuminated earlier events in human
evolution and suggested that early hominids had a richer evolutionary history than was
previously appreciated. Analysis of Neanderthal genomes extracted from the remains found in
Europe and Western and Central Asia and dated 230–30 thousand years ago (kya)
demonstrated that contrary to previous suggestions, Neanderthals and anatomically modern
humans (AMH) may have interbred.
1,8,39-46
The studies showed that Neanderthals share more
genetic material with modern humans across Eurasia than those from sub-Saharan Africa,
indicating that genetic flow from Neanderthals to Eurasian AMH likely occurred after the
emergence of humans from Africa but before the divergence of Eurasian groups.
1,8
Additional
gene flow events may have occurred later in Europe
44
and East Asia.
41,47
Mitochondrial DNA
(mtDNA) sequences of morphologically ambiguous Neanderthal bones from Teshik-Tash cave in
107
Uzbekistan and Okladnikov cave in Southern Siberia provided evidence that Neanderthals had
an extensive range prior to their extinction.
48
Since the first description of the Neanderthal
genome, a number of studies have suggested that various Neanderthal alleles have been
preferentially retained in modern populations due to specific selective pressures.
49
Remarkably,
the proportion of Neanderthal ancestry in Eurasians decreased substantially since the
Palaeolithic, from 4–6% to 1–2% today, suggesting that negative selection against Neanderthal
alleles is at work.
28
Genetic analysis of mtDNA from a phalanx dated 48–30 kya recovered from the Denisova cave
in Southern Siberia revealed another hominid, named Denisovan, which is genetically distinct
from Neanderthals and modern humans.
50
Since then, only two more samples (molars) of
Denisovans have been discovered
3
and recently sequenced.
51
Comparison of Neanderthal
1,8
and Denisovan
3,51
genomes suggested that for a long time their population histories were
independent of each other. The Denisovan mtDNA represents a deep branch, with the
Neanderthal mtDNA closer to that of modern humans
3
. Comparative analysis of the Denisovan
and modern human genomes revealed that the genetic contribution from Denisovans to
modern humans may have been restricted to Melanesia and Australia with hybridization events
taking place mostly in the Southeast Asian mainland, although they may have permeated to
Oceania
3,4,50,52,53
as recently suggested by the existence of a widespread, low-level signal of
Denisovan ancestry across South and East Asian and Native American populations.
54
However,
the exact scenario is hard to identify.
55,56
NGS analysis of a nearly complete mitochondrial genome of a hominid found in Sima de los
Huesos cave in Atapuerca, Spain and dated to more than 300 kya
57
suggested the existence of
another branch in the human evolutionary tree. Surprisingly, the Sima de los Huesos mtDNA
forms a clade with the mitochondrial genome of Denisovans rather than that of Neanderthals,
demonstrating an unexpected link between Denisovans and Middle Pleistocene European
hominids. Recently, approximately 3 million bases of nuclear sequences were obtained from a
108
Sima de los Huesos femur fragment, an incisor, and a molar.
27
In contrast to the mtDNA, the
nuclear genomic sequences of Sima de los Huesos are significantly more similar to
Neanderthals than to Denisovans.
27
These results agree with previous morphological
analyses
58,59
but present an archaeological puzzle.
Studies of aDNA have also delineated human migration routes around the world, particularly in
Europe. Analysis of human genomes from Europe and Siberia dated 24–5 kya
7,10,14,34,35
revealed
at least three different sources of the population diversity of modern Europeans, i.e. West
European hunter-gatherers, ancient North Eurasians with high similarity to Upper Palaeolithic
Siberians, and early European farmers originating in the Near East.
7,60
Further aDNA studies allowed mapping migration in Europe in greater detail. A recent study of
69 European individuals who lived 8–3 kya
12
demonstrated that between 8 and 5 kya
populations of Western and Eastern Europe were genetically distinct. Groups of early farmers
of Near Eastern origin
60
arrived in Western Europe and mixed with local hunter-gatherers,
whereas Eastern Europe at that time was inhabited by a distant branch of ancient North
Eurasian hunter-gatherers.
10
However, Eastern Europe did not remain a ‘hunter-gatherer’s
refuge’ for too long. Around 6–5 kya, farming populations of West Anatolian ancestry appeared
in Eastern Europe and mixed with local hunter-gatherers in the Pontic-Caspian region, giving
rise to pastoralist people of the extremely successful Yamnaya archaeological culture. Such
multiethnic melting pots were fertile ground for many innovations, such as horse domestication
and wheeled vehicles from the Yamnaya culture,
61
which probably enabled massive migration
or invasions into Western and Northern Europe approximately 4.5 kya, introducing their
ancestry, languages, and customs. Haak et al. reported that this steppe ancestry persisted in
central Europeans from at least 3 kya, and it is ubiquitous in present-day Europeans.
12
At
approximately the same time, similar migrations spread Yamnaya-related cultures into South
Siberia and Central Asia, as revealed by another large-scale study of 101 genomes from
Eurasian Bronze Age (5-3 kya) burial sites.
11
Such large-scale aDNA studies
11,12,60
have not only
109
made technical breakthroughs but also had significant interdisciplinary effects: they influenced
the decades-long debate in archaeology and linguistics about the origin of Indo-European
language speakers and shed light on perennial questions about the prevalence of traits like skin
colour and lactose intolerance in modern Europeans.
With the progress in aDNA sequencing technology, notable studies of remains from all over the
world have begun to emerge,
62-64
shedding light onto, for example, settlement of China and the
Pacific islands. Recently, Liu et al. reported the discovery of an 80,000-yr-old man (the earliest
modern human in southern China) raising questions about canonical paradigms of human
dissemination, since there is no evidence that humans entered Europe before 45 kya.
65
Sequencing of this individual would provide additional insights into the dispersal of humans in
Eurasia. A recent study of a 4,500-yr-old Ethiopian skeleton preserved in relatively cool
mountainous conditions was the first example of successful aDNA analysis in Africa,
26
giving
hope to forthcoming studies of this incredibly interesting region.
Pending future advances in functional genomics, aDNA might prove an unrivalled source of
information on the evolution of traits associated with cognitive phenotypes. For example,
discovering the genetic variants responsible for language acquisition may allow researchers to
pinpoint the origin of complex language in the human lineage, indubitably a cornerstone event
in human evolution. Approximately a decade ago the Neanderthals were found to bear a
modern human version of FOXP2
22
gene (likely responsible for the ability to speak
66
). The
authors suggested that the modern variant of FOXP2 was present in the common ancestor of
Neanderthals and modern humans.
22
We can also expect aDNA genomic studies to provide direct evidence about human adaptation
substantiating the genetic basis of selection. For example, a genome-wide scan of 230 West
Eurasians who lived 6.5–1 kya and their comparison with modern human genomes identified
110
significant signatures of selection in a range of loci related to diet (lactase persistence, fatty acid
metabolism, vitamin D levels, and some diet-associated diseases), pathogen resistance, and
externally visible phenotypes (skin and eye pigmentation, tooth morphology, hair thickness,
and body height).
60
This work demonstrated the utility of aDNA data in human adaptive
evolution studies. The currently available set of published human aDNA NGS data, including
sample IDs, dating, archaeological cultures, site names and locations, references, and links to
data repositories, is in the supplementary Table 3.2 and illustrated in Fig. 3.2.
Figure 3.2. Geographic distribution of existing whole genome aDNA sequences
3.1.2. Historic patterns in the spread of infectious diseases
Some devastating pandemics, like the Black Death, remain infamous even centuries after these
catastrophes. aDNA enables discovery of the origin and spread of disease-carrying alleles to aid
modern epidemiology. Such analyses are possible when genotypes of ancient humans are
recovered along with the genomes of their pathogens. For example, Rasmussen et al.
111
sequenced DNA extracted from ancient human teeth and found that Yersinia pestis, the
etiological agent of plague, infected humans in Bronze Age Eurasia as early as 5 kya, three
millennia before the first historical records of plague.
67
The authors concluded that the
bacterium became the highly virulent, flea-borne bubonic plague strain only about 3 kya by
acquiring specific genetic changes.
67
Analysis of Mycobacterium tuberculosis genomes from remains of ancient humans and animals
helped in deciphering the origin and dispersal of tuberculosis in human populations. aDNA
studies provided support for the hypothesis that the appearance of tuberculosis in humans was
not connected to animal domestication as it was suggested before. On the contrary, M.
tuberculosis strain in humans is the most ancient one and other tuberculosis strains causing
animal diseases evolved from the human strain.
68
Tuberculosis spread with humans and
evolved in local conditions.
69,70
The most ancient, so far, human M. tuberculosis strain was
discovered in a 9,000-yrs-old pre-pottery Neolithic settlement in the Eastern Mediterranean
66
where, in spite of the presence of quantities of bovine bones, no signs of the bovine strain, M.
bovis, were found. Discovery of M. bovis strain in human remains from the Iron Age (as well as
animal-like Mycobacterium strains in pre-Columbian humans) showed that back-infection from
animals took place
6,67
at a later time.
Studying dental plaque of Europeans from different periods (Mesolithic, Neolithic, Bronze Age,
Early Medieval, Late Medieval, and present time) demonstrated important shifts in human oral
microbiota during recent evolution. The first shift took place in the early Neolithic period with
the introduction of farming when more caries- and periodontal disease-associated bacterial
taxa were detected. The oral microbiota composition remained stable between the Neolithic
period and modern times. Recently, possibly during the Industrial Revolution in the nineteenth
century, cariogenic bacteria became dominant, likely due to consumption of industrially
processed flour and sugar. Consequently, the genetic diversity of the oral microbiotic
112
ecosystem was impinged, which contributed to the spread of chronic oral and other diseases in
countries with post-industrial lifestyles.
71
One of the most remarkable achievements in the field is the study of historical RNA. In 1997,
Taubenberger et al.
19
extracted and analysed RNA from the virus that caused the ‘Spanish flu’
pandemic that killed at least 20 million people in 1918–19.
19,72-76
Reconstruction of the viral
genome helped to reveal its origin and discover the mechanism of its exceptional virulence. In
contrast to modern influenza viruses, which require an exogenous protease for their
replication, the 1918 pandemic virus could replicate without exogenous trypsin. The ‘Spanish
flu’ viral genome contained a constellation of genes essential for optimal virulence, which
contributed to the strain’s ultra-high virulence.
75
This knowledge enabled epidemiologists to
develop a vaccination strategy against another potential pandemic virus.
74
113
3.2. Adaptation of experimental and computational methods to the
specific biochemistry of aDNA
After the death of an organism, all of its biomolecules are degraded either by host enzymes
released from their proper compartments or by saprobic microorganisms. Therefore, compared
with modern DNA, aDNA has lower concentration; it is fragmented, contaminated, and
chemically modified.
16,77,78
aDNA is also commonly damaged by strand breaks and cross-linking
in addition to oxidative and hydrolytic degradation of bases or sugar residues. Relative
preservation of DNA in old samples depends on environmental circumstances, such as
temperature, humidity, pH, or oxygen, rather than the absolute age of the sample. For instance,
DNA samples extracted from frozen remains dated thousands or even hundreds of thousands
of years can be of better quality than much more recent samples.
5,79-81
Recent studies showed
that the age of ‘readable’ (by current methods) aDNA products is restricted to about 1–1.5
million years.
11,75
At present, the 560–780 thousand-years old Middle Pleistocene horse is the
most ancient organism from which reliable aDNA data have been procured.
5
Below, we
describe methods for overcoming difficulties caused by each one of the special aDNA features
(summarized in Table 3.1).
3.2.1. Degradation
Early success with aDNA extraction and sequencing raised hopes that museum specimens,
ancient samples, and archaeological finds would provide a plethora of aDNA, but such hopes
faded when it became clear that these old samples did not yield any usable DNA.
79
Unfortunately, it is not uncommon for aDNA projects to be disbanded due to low or
undetectable DNA content
82-84
. In many other projects, the aDNA concentration is so low that it
demands destructive sampling to yield adequate sequencing coverage. That, in turn, results in
low genomic coverage (percentage of the length of the reference genome that is covered by
mapped reads from the sample) and less reliable genotype calls. In their analysis of
Neanderthal DNA, Green et al. reported GC content to be positively correlated (r = 0.49) with
114
Table 3.1. Difficulties of working with ancient DNA and specialized methods developed to encounter them. The
solutions aimed at one or more of the problems are not mutually exclusive and are often used in combination
for better results. Also, various bioinformatics ideas for tackling contamination and base damage are sometimes
integrated into a single Maximum Likelihood framework for base and genotype calling.
Problem Experimental solutions Bioinformatics solutions
Degradation Improved extraction protocols
Using NGS approach (catches short DNA
fragments)
Algorithms based on genotype likelihoods
rather than a single best genotype for low
coverage genomic positions
Base damage Using DNA Polymerases which does not
amplify through uracil (remove uracil-
containing fragments from the reaction)
Treatment with Uracil-N-glycosylase plus
Endonuclease VIII (removes uracil, then
repairs abasic sites)
Single-primer Extension PCR (analyses
separate DNA strands)
Trimming 5-7 bases from read ends
Counting and excluding C>T and G>A mutations
at ultra-conserved positions
Apparently accelerated evolution on branches
leading to ancient samples
Comparing frequencies of different classes of
transitions in modern-modern and modern-
ancient alignments
Estimation of contamination or divergence
based on indels and transversions only, not
transitions
Exclusion of common ancestor-ancient sample
branches from calculation of divergence
Contamination Special protocols for sample collection,
transport and storage
Special pre-digestion steps (including
mechanical and chemical
decontamination, short-time pre-
incubation)
Independent replication in two labs
PCR-capture with species-specific primers
Exclusion of extremely long reads or alignments
(in case of 454 or Sanger sequencing)
Phylogenetic correctness (exclusion of reads
based on similarity with non-target species;
inclusion of reads based on similarity with the
target species or a close relative)
Conformity to species- or ethnicity-specific
variants or haplotypes
Unique individual origin of reads from one
specimen: homozygosity of X and Y positions in
male specimens, absence of Y reads in female
specimens, homozygosity of mtDNA positions
Absence of haplotypes present mainly in
unrelated specimens or research team
members
Distinguishing mtDNA sequences from NUMTs
115
retrieval success of sequence fragments,
1,23,85,86
likely due to the faster denaturation of AT-rich
regions. They also found G and T overrepresented at the 5’ and 3’ ends of break points and
suggested de-purination as a significant cause of strand breaks.
23
Some of the difficulties in working with aDNA were resolved by technological breakthroughs.
Improvement of extraction protocols can substantially increase the quantity and quality of
aDNA. Thus, modern protocols
4,87
enable extraction and analysis of very short fragments (less
than 50–60 bp, which constitutes the vast majority of aDNA). DNA fragmentation posed
difficulties for conventional PCR, which requires amplification of a large number of overlapping
fragments to cover a relatively long fragment of DNA, and it is impossible to sequence very
short fragments (50–70 bp) using Sanger sequencing. However, NGS technologies generate
short reads for any DNA. The average retrieved sequence length in most aDNA projects is 50–
100 bp, which is the same order of magnitude as the length of reads produced by many current
NGS instruments.
Fragmentation and decay of DNA is a natural occurrence not only postmortem but also in vivo.
Spontaneous DNA degradation caused by damaging and mutagenic factors is prevented by DNA
repair mechanisms that are not present after death. However, controlled DNA degradation in
living organism is implemented during programmed cell death (apoptosis) and differentiation of
certain cell types (i.e. erythroid, lens and hair cortical cells). A large family of DNase enzymes
performs the DNA degradation vital for proper development and functioning of living tissues.
Apoptotic processes leading to these changes and DNA degeneration explain the average
length of DNA fragments of 140–160 bp and under extracted from ancient mammoth hairs.
88,89
Many processes leading to DNA degradation, including those that accompany cell and tissue
senescence (telomere shortening, error accumulation during DNA synthesis), occur naturally in
vivo. Apoptosis finds its continuation in postmortem tissues, leading to further fragmentation of
DNA even in favourable conditions for specimen preservation. The detailed biochemistry of
116
processes occurring after death still requires further evaluation, and elucidation of their
contribution to aDNA quality might be a promising area for research.
3.2.2. Contamination
Even after successful DNA extraction, results must always be checked for authenticity. aDNA is
often contaminated with some level of exogenous DNA (e.g. DNA from ancient or modern
saprotrophic bacteria or fungi), postmortem juxtaposition of organisms, or modern human DNA
from the researchers themselves. Naturally, low amounts of aDNA (or its complete absence) in
the sample might facilitate the domination of PCR products by exogenous DNA, resulting in the
recovery of irrelevant sequences. Indeed, in the 1990s, a large number of papers were
published reporting DNA sequences from extremely ancient remains such as Miocene plant
fossils,
90,91
amber-entombed organisms,
92,93
250-million-yr-old bacteria in salt crystal,
94
and
dinosaur bones and eggs.
95-98
In one such case, researchers reported successful extraction and
amplification of mtDNA cytochrome b fragment from a Cretaceous Period dinosaur.
95
The
sequences differed from all modern cytochrome b sequences. This led the authors to believe
that they had sequenced authentic DNA from 80-million-yr-old bones. It was later discovered
that those mtDNA sequences were not close to avian and reptilian mtDNAs, as would be
expected from their phylogenetic history, but rather to mammalian (including human) mtDNAs.
It was thereby suggested that the alleged ‘dinosaur’ DNA was contaminated, presumably by
modern human DNA.
99-102
A similar course of events occurred in the study of ancient bacterial
DNA supposedly preserved in 250-million-yr-old salt crystals, which turned out to be modern
bacterial DNA.
103
In addition to these examples, several other aDNA projects have been
impeded by contamination of ancient samples.
98,103-106
To prevent contamination, the experiment must be properly managed, including special
requirements for sample collection, sterilization of the working area, DNA authentication, and
independent reproducibility.
97,107
These protocols are constantly being refined and improved.
For example, in addition to mechanical removal of the upper layer and UV and/or bleach
117
treatment of the sample, a brief pre-digestion step was recently suggested,
108
consisting of
short-term sample incubation in an extraction buffer and its subsequent removal. According to
the authors, this step alone increases the fraction of endogenous DNA several fold. In general,
sequencing preparation step plays an important role in minimizing contamination. When there
is sufficient material, sequencing library may be prepared entirely without using PCR, greatly
minimizing potential for sample contamination.
109
Recent work actually takes advantage of
post-mortem modifications to enrich for endogenous versus contaminated sequences.
110
In shotgun sequencing of vertebrate samples, a substantial fraction of the reads comes from
contamination with environmental DNA from bacteria and fungi.
2,20,111
Microbial sequences are
often remarkably different from target species sequences and thus should be easily flagged by a
standard BLAST search against the NCBI non-redundant nucleotide database. This strategy,
however, fails to discover most of the microbial sequences that have yet to be sequenced.
Therefore, it is not surprising that a large fraction of reads in many aDNA libraries is labelled as
‘unknown’ or ‘unclassified,’ mainly due to the unidentified microbial content.
86
Frequently,
mapping the shotgun sequencing reads onto the reference genome of the target species (or the
closest genome at hand) and discarding all reads below a certain level of similarity is
preferred
112
alongside choosing tissues with less microbial DNA. For instance, it has been
suggested that hair shafts or avian eggshells contain less microbial DNA than bone,
88,113
but
these tissues are not available for most ancient samples. Alternatively, recovery of bacterial or
fungal sequences is not very likely for PCR-based capture methods, as primers are designed
based on known sequences from the sample’s own species or its close relatives.
The intricacy and method of detecting and removing modern human contamination depends
on the distance of the target species from humans. Expectedly, it is much easier to handle
distantly related species such as mammoths, penguins, or cave bears than archaic hominids,
like the Denisovans and Neanderthals, and particularly ancient modern humans. Moreover, the
archaeological material in Europe is usually excavated and later handled, extracted, and
118
sequenced by Europeans, sometimes from the same region. The same is generally true for
other territories around the world. When a limited number of loci are sequenced from PCR or
cloning products, it is possible to examine alignments visually and to inspect individual
polymorphic positions to determine which differences are genuine and which are likely
artefacts or contamination;
114,115
however, with reads from shotgun sequencing technologies,
automated methods are typically required.
Analysing sequence reads in a phylogenetic framework along with sequences from ancient and
extant relatives and outgroups is one of the initial steps to ensure that ancient sequences fit
within the acceptable phylogeny and flag probable contamination. For instance, sequences
from the mammoth were compared with those of the elephant, its closest kin, and to
outgroups, such as humans and dogs, to ascertain phylogenetic correctness.
20
Filtering reads
that were mapped onto the elephant genome with a high score and matched the elephant
genome better than that of human, dog, or other species helped to remove human and
microbial contamination. Neanderthal samples were phylogenetically examined to see if they
fall outside the range of modern human variation.
48
Initially, a number of human and non-human studies filtered out samples with long sequence
fragments considered evidence of contamination since authentic aDNA is supposed to be
fragmented.
82,115
However, it has become clear that the average aDNA fragment length can
vary substantially between samples and can overlap with contaminant fragment lengths;
therefore, more elegant approaches are needed to develop authentication criteria based on
length. In a study of Neanderthal DNA, estimates of human-Neanderthal sequence divergence
and the percentage of C→T and G→A (equivalent events) misincorporations did not vary
significantly with alignment length.
116
Existence of substantial modern DNA contamination
would have produced two types of fragments: authentic ancient ones which were short and
had high numbers of mismatches (showed high divergence versus modern human reference),
and modern contaminant ones which were long and showed few mismatches (showed low
119
divergence versus modern human reference). Noonan et al.
116
remarked that the absence of an
inverse relationship between alignment length and divergence from the human reference
meant that the level of contamination with modern DNA was negligible in their dataset;
however, they did not provide a quantitative estimate. The problem with this approach is that
even among authentic ancient fragments, short fragments presumably represent higher rates
of base modification and consequently may produce upward-biased divergence estimates.
85
Through the accumulation of ancient sequences over time, positions at which the target
sequence (e.g. Neanderthal or Denisovan) have been invariably different from the likely
contaminant (e.g. modern humans) can be used to estimate modern DNA
contamination.
23,117,118
Here, mtDNA is the marker of choice because of its high copy number,
leading to greater sequencing depth; however, the validity of extrapolating mtDNA
contamination estimates to nuclear sequences has been questioned based on possible
differences in the conservation properties of mtDNA and nuclear DNA.
85
As base modification
and misincorporations in aDNA often involve C to U (T) and A to G transitions, contamination
with external DNA can be more reliably estimated using transversion or indel counts.
23
Even
when sufficient prior data on sequence variation in the archaic hominid population is available,
the fraction of reads that deviate from consensus base calls at haploid loci, e.g. those on
mtDNA or the Y-chromosome, can provide an estimate of exogenous DNA—assuming that
authentic aDNA is more abundant than contamination and that correct sequence reads are
more likely than errors. This method is especially applicable to positions at which the modern
human population is fixed for the derived base while the archaic consensus base is ancestral.
3,53
Ancient modern humans are not expected to carry informative (fixed) substitutions compared
with contemporary humans or necessarily to fall outside the range of modern human
phylogeny, although they might do so. A first step in the QC of sequences from ancient modern
human samples is to ascertain that all sequence reads come from a single individual. This can
be done by estimating X-linked heterozygosity in male samples, Y-linked heterozygosity in male
120
samples, Y-linked presence in female samples, or mtDNA heterozygosity for either
gender.
2,53,85,119
Next, it is necessary to show that each specimen in the dataset carries unique
sequences (e.g. mtDNA or Y-chromosome haplotypes) that are different from sequences of
other specimens and from the researchers.
83,120,121
As most of the ancient sequences during the pre-NGS era or shortly thereafter were limited to
mitochondrial markers,
23,122,123
it was crucial to distinguish them from nuclear inserts of mtDNA
(NUMTs). Generally, a higher alignment score to the mitochondrial sequence than to the
nuclear sequences is the authentication criterion. For extinct species without a reference,
where sequence reads must be mapped to the genome of another species, this becomes more
complicated because the divergence of orthologous sequences must be considered in addition
to differences between NUMTs and their mitochondrial counterparts.
23
Considering the low
likelihood of heteroplasmy, observing more than one allele with non-negligible frequencies at
each position would indicate either external contamination or sequencing of NUMTs.
3.2.3. Postmortem base modification
Postmortem DNA modifications through hydrolysis and oxidation pose another substantial
difficulty for studying aDNA. The most significant alteration is nucleotide deamination, which
leads to false transitions during PCR: cytosine to uracil, 5-methyl-cytosine to thymine (both
causing incorporation of T instead of C), and, more rarely, adenine to hypoxanthine (causing
incorporation of G instead of A).
82,124-127
Chemical modification of nucleotides can lead to
reduced sequencing coverage because they prevent mapping of many authentic reads due to
an overestimated number of mismatches compared with the reference. They can also result in
the erroneous base and genotype calls and false estimates of genomic parameters such as
heterozygosity, nucleotide diversity, GC content, or divergence times. Base modifications are
often observed in the 5–7 final bases of DNA fragments and are thought to occur more readily
in the terminal, single-stranded overhangs.
128
These terminal misincorporations are even more
121
problematic because local sequence alignment methods used for mapping the NGS reads onto
the reference genome rely heavily on matching initial bases to the reference.
86
To overcome problems with chemical modification, several approaches have been developed.
Treatment with uracil-N-glycosylase (UNG) removes uracil residues, thus preventing replication
of fragments with deaminated cytosine;
124,129
however, the resulting abasic sites prevent
replication by DNA polymerases, which excludes all the fragments with uracil from the reaction.
This can be crucial for valuable ancient samples already having low DNA concentrations. A
simple modification was suggested recently to overcome this problem:
130
follow-up treatment
with endonuclease VIII after UNG repairs most of the abasic sites and enables subsequent
analysis of these fragments. This procedure, however, does not resolve the problem of false
A→G transitions. Using DNA polymerases such as Phusion (Pfu), which does not amplify uracil,
also avoids false C→T (but not A→G) transitions but excludes all uracil-containing fragments
from amplification, which further decreases the DNA template in the reaction. Also, since these
enzymes can work with methylated, deaminated cytosine (i.e. 5-methyluracil, thymine), the
problem remains for methylated aDNA. Single primer extension PCR (SP-PCR) enables analysis
of separate DNA strands, which makes it possible both to distinguish real mutations from
postmortem modifications and to evaluate the level of these modifications.
124-127,131
SP-PCR is
performed in two steps: first, PCR with only one primer is carried out to accumulate only one
DNA strand, and then the second primer is added to the reaction and PCR continues with a
normal protocol. The resulting PCR product derives mainly from one of the DNA strands.
Analysis of these products can identify in which DNA strand postmortem modification occurred.
This method requires very thorough selection of PCR primers and annealing temperatures,
otherwise non-specific annealing or formation of primer dimers is highly possible.
One estimation strategy for base modification compares the percentage of T and A calls at
ultra-conserved C and G positions, respectively. These genomic positions are expected to have
retained their ancestral state in the ancient sample, so transitions exclusively observed in the
122
ancient sample can be attributed to base misincorporations.
4,89
Another method compares the
frequencies of different types of transitions and transversions in ancient-modern and modern-
modern sequence alignments of closely related species (e.g. Neanderthal-human, Neanderthal-
chimp, and human-chimp). An excess of C→T (and G→A, respectively) transitions in modern-
ancient alignments provides an estimate of base modification.
86
The third strategy takes
advantage of the direction of transition induced by base modification. In a 2006 study, the C→U
modification in mammoth DNA caused the apparent rate of (mammoth T) → (elephant C)
transitions to be 1.9-fold larger that of (mammoth C) → (elephant T) transitions.
20
Recently
developed experimental protocols, such as pre-treatment of aDNA with UNG, have reduced the
magnitude of this problem.
If the level of base modification is non-negligible, steps must be taken to eliminate or lessen its
effect on the output of downstream population genetic analyses. Sometimes, C→T / G→A or all
transitions are simply left out of the analyses, and only transversions and indels are included in
the calculation of divergence or reconstruction of phylogeny.
1
Alternatively, it is possible to
polarize polymorphisms into ancestral and derived states using an outgroup (e.g. chimp for
human-Neanderthal comparisons) to place the mutation events on the corresponding branches
of the phylogenetic tree using a parsimony approach (of which the branch leading to the
ancient sample will probably contain disproportionately high numbers) and to calculate
divergence times using information from branches leading to modern samples only.
86
Another
strategy takes advantage of the observation that most of the base modifications occur at the 5′
and 3′ ends of fragments and trims a few (5–7) bases off either end of each sequence read to
lower the chance of including a misincorporated base.
2,119
Although the problems associated with aDNA anomalies are not insurmountable using available
experimental and bioinformatics technologies, drastic variations in the type and magnitude of
damage among ancient remains make it impossible to develop a universally successful protocol
for aDNA extraction and sequencing. For instance, the fraction of authentic Neanderthal
123
mtDNA among six examined ancient samples varied from ~1% to ~99%,
86
and the level of
contamination in five well-preserved human bone specimens dated 800–1600 CE varied from
0% to 100%.
120
The Neanderthal mitochondrial genome and partial nuclear genome were
retrieved using data from several sequencing attempts.
48,86,116
This compendium was crucial in
determining design parameters for assembling the full Neanderthal nuclear genome.
8
The
contaminating sequences in an ancient maize microsatellite genotyping project were found to
be of different natures across samples: some exhibited mainly microbial contamination,
whereas others contained copies of transposable elements.
82
Therefore, an initial round of
extraction and sequencing is recommended to estimate quality parameters for each sample
(e.g. yield, chemical modification, % contamination, % uniquely mapped reads, and % genome
covered) to inform appropriate experimental and data preparation strategies. It is also
important to remember that both experimental and computational methods of the overcoming
of aDNA problems have advantages and limitations. Therefore, to achieve the most reliable
results, it is of great importance that researchers use both these approaches to examine aDNA
(Table 3.1). One of the good examples of the combination of novel experimental and
computational approaches, as well as of the good correspondence of the methods to the goal is
the paper of Haak et al.
12
where they employed powerful experimental protocols, stringent
quality control procedure, as well as bioinformatics and population genetics approaches to test
hypotheses about the steppe origin of Indo-European languages carriers.
124
3.3. Analysis of aDNA data
3.3.1. Software tools for pre-processing of aDNA NGS data
An important consideration for the analysis of aDNA, which typically undergoes many rounds of
amplification, is the presence of PCR duplicates, which must be identified, and ideally
removed.
132
Once the sources of contamination or base misincorporation are detected and
removed from the aDNA sequences, it is possible to infer genotypes for further analysis. In
addition to regular NGS data quality control and pre-processing steps, application of specialized
tools is required to address the special features of aDNA. Genotypes can be inferred more
accurately by combining observed read bases with various estimators of contamination, base
modification, sequencing error, and read alignment quality combined via a single maximum
likelihood (ML) calculation.
133
The ML model can be designed in the haploid mode for mtDNA,
or X- and Y-chromosomes in males, or diploid mode for autosomal markers or X-linked markers
in females.
2
Depending on the specific design, ML models can use these estimators to output
the genotype or co-estimate all of these parameters simultaneously.
Current NGS analyses of aDNA are performed with well-established but non-specialized
computational tools as novel customized tools for aDNA analysis have not yet been widely
accepted, and custom scripts have to be written to adjust for aDNA specifics. Base calling is
frequently performed with Illumina's standard base-caller Bustard BayesCall
134
(flexible model-
based tool) and freeIbis
135
(utilizing a multiclass Support Vector Machine algorithm). FastQC
136
is typically used for preliminary quality control of reads. AdapterRemoval,
137
CutAdapt,
138
and
SeqPrep
139
are currently the most common tools in the aDNA world for de-multiplexing,
adapter trimming, low-quality call trimming, and paired-end merging. Burrows-Wheeler Aligner
(BWA) and Bowtie are commonly used for mapping of aDNA reads.
140
Since BWA and Bowtie
were developed for high-quality modern DNA reads, parameter adjustments to reflect
properties of аDNA must be done. For example, it may be advisable to trim likely-damaged
125
positions, disable the seed, adjust gap openings and penalties, and permit indels at read
ends.
140
Genome Analysis Toolkit (GATK)
141,142
or SAMtools
143
are then used for variant calling.
Methods should be optimized for shorter (17–35 nucleotide) reads with possible adapters on
both ends of the read and a large overlap between paired reads, and the call corresponding to
the highest quality score should be selected at each position. In addition, due to low quantities
of endogenous DNA, the high number of short reads, and high levels of contamination, masking
repeat regions may improve read mapping. Several pipelines (e.g. aLib
144
and PALEOMIX
145
)
incorporating all these changes were designed for aDNA analysis. Sometimes selected elements
of such pipelines are combined to achieve optimal performance. For instance, the leeHom
module of aLib is used to pre-process reads
146
while Anfo,
1
MIA,
117
and BWA-PSSM
147
are used
for subsequent read mapping.
Figure 3.3. Flowchart of a typical bioinformatics pipeline for aDNA analysis using NGS data
126
After read pre-treatment and alignment, the text file containing the sequence alignment data
(termed the SAM file or BAM for binary files) can be used to estimate contamination and
degradation levels using tools such as mapDamage
148
and mapDamage2.0.
149
PMDtools
150
(identification of those DNA fragments that are unlikely to come from modern sources) and
Schmutzi
151
(maximum a posteriori estimator of mitochondrial contamination for ancient
samples) are utilized to select reads for re-analysis that have higher chances of coming from
aDNA. Depending on the situation, analysis can be done in a fully automated cycle for the
entire genome or only for mtDNA. Such algorithms report the probabilities of different types of
postmortem DNA degradation, which allows for better statistical modelling at the variant calling
stage, employing SNPest
152
or custom scripts. A typical NGS pipeline for aDNA analysis is shown
in Fig. 3.3.
The amount of extracted endogenous DNA may allow satisfactory coverage of aDNA sequences
(as high as 10–20x coverage at a subset of regions for a few samples), comparable to modern
DNA studies. Nevertheless, it is very common for ancient samples to have ~1x average
coverage. In such cases, population genetics analysis can still be done using ADMIXTURE and
other standard tools by choosing the variant with the highest number of supporting reads (or
with the highest quality) instead of trying to make heterozygous/homozygous calls at each
autosomal position (which is tricky when there are <5 reads covering a given position resulting
in 3–4 conflicting variants). This method can be used when the amount of contaminating
modern human DNA is much lower than the amount of endogenous DNA.
Special care needs to be exercised when combining SNP data from ancient and modern
samples. Recently published analysis of the first ancient African genome
26
presented an
erroneous conclusion that genomes of individuals throughout Africa contain DNA inherited
from Eurasian immigrants.
153
The error was noticed, and the authors published an erratum
stating that it had been necessary to convert the input produced by SAMtools to be compatible
127
with PLINK, but this step was omitted causing the removal of many positions homozygous to
the human reference genome.
153
This example illustrates the importance of using validated
pipelines for aDNA analysis.
3.3.2. Interpretation of aDNA data
Below, we discuss the analytical methods for biologically relevant interpretation of aDNA data.
Various population genetics analytical methods have been applied to infer past demographic
events based on data obtained from aDNA studies. One of the basic methods for identifying
ancient haplotypes is scanning present-day populations for variants identified in the aDNA. This
simple approach provides an estimate of populations/regions that harbour such ancient genetic
signatures and has been successfully applied to identify modern European populations with
mtDNA mutations that were found in aDNA samples.
154,155
Analysis of single-nucleotide
polymorphisms (SNPs) in prehistoric samples can shed light on ancestral phenotypes, including
pigmentation of skin, hair, and eyes,
37
and the sex of the sample can be computed as the ratio
of reads mapping to the Y- and X-chromosomes.
156,157
In the case of uni-parental markers such
as mtDNA variants and Y-chromosome markers, the mutational distance between the ancient
and modern haplotypes is visualized using phylogenetic network analysis programmes.
155,158
Network analysis of haplotype data reveals genetic distance, mutation rate, and regions of
haplotype spread. Recently, a novel method for dating ancient human samples was
developed.
159
The method is based on a recombination clock and shared history of Neanderthal
gene flow into non-Africans.
159
With increased numbers of recovered ancient and historic DNA samples and steady
improvements in aDNA sequencing technology, scientists can study the distribution of ancient
human genetic variation and compare it to that of modern populations
160
or gain a deeper-level
understanding of the distribution of genetic variation within populations by applying admixture-
based tools for joint analysis of modern and ancient samples at a population level. Tools and
approaches (such as PCA,
161
STRUCTURE,
162
ADMIXTURE,
163
SPAMIX,
164
SPA,
165
ADMIXTOOLS,
166
128
GPS,
167
LAMP,
168
HAPMIX,
169
reAdmix,
170
MUTLIMIX,
171
mSpectrum,
172
SABER,
173
and others)
which were initially developed for population analysis of contemporary individuals, can be
applied in combination with anthropological data and historical records to reconstruct
migration patterns, provenances, and local and global ancestries of extinct populations.
ADMIXTURE is a computational tool for ML estimation of individual ancestries from multi-locus
SNP genotype datasets. Recently, Allentoft et al.
11
inferred the ancestral components from
modern samples and then projected the ancient samples onto the inferred components using
the ancestral allele frequencies inferred by ADMIXTURE. Comparison of admixture profiles of
ancient and modern populations within a given region informs the generation of hypotheses
about population migrations that can be validated with independent sources and methods of
analysis. NGSADMIX uses genotype likelihoods instead of called genotypes to resolve ancestry,
which is particularly useful considering the myriad sources of uncertainty in aDNA NGS data.
174
GPS algorithm determines the provenance of an individual — a point on a globe where people
with similar genotype live. Tools like SPAMIX and reAdmix model individual as a weighted sum
of reference populations.
To infer the geographical origin of a specific haplotype, it is essential to partition the genome
into haplotypes with distinct ancestries that may have been inherited from multiple
populations. Such haplotypes can be obtained using ‘local ancestry’ tools (HAPMIX, or SABER,
LAMP, and MULTIMIX and others) which allow inference of ‘local ancestry’ instead of the
‘global ancestry’ (that can be inferred with PCA, SPAMIX, GPS, ADMIXTURE, STRUCTURE,
reAdmix) and their usage depends on the complexity of the dataset, the expected mixture
levels, and the available phenotypic data. For instance, if the phenotype is associated with a
particular trait, a ‘local ancestry’ tool is preferred, whereas a ‘global ancestry’ tool should be
used when the phenotype is a complex trait involving multiple unknown loci. The list of the
described software tools is shown in the Supplementary Table 3.1.
129
When several individuals of an ancient population are available, certain population genetic
parameters can be estimated. For example, by examining a number of microsatellite loci in
160–200-yr-old Daphnia samples,
175
researchers were able to calculate the heterozygosity,
gene diversity, deviation from Hardy–Weinberg equilibrium, and linkage disequilibrium
between pairs of markers. An analysis of 21 samples from a graveyard in Germany dated ~6 kya
allowed analysis of the mtDNA and Y-chromosomal haplotype diversity as well as the selection
forces inferred from Tajima’s D.
83
Jaenicke-Despres et al. discovered allelic variants of three
genes that differentiate modern maize and teosinte from 11 maize cobs dating to 660–4,400
years ago, opening a window to the genetic chronology of maize domestication.
176
It should be
noted, that quantitative data, such as allele frequencies from multiple poorly preserved sample
should be treated with caution, as post-mortem DNA degradation can bias allele frequency
estimates.
109
However, population genetics also offers a range of neutrality tests, such as
Hardy–Weinberg equilibrium, which can be used to check for the presence or artefactual sites,
when compared with the present-day data.
Even when only a single member of an ancient population can be recovered, a number of
genomic and evolutionary inferences can be made about its taxon or population. For example,
based on remarkably low levels of dN/dS (non- synonymous to synonymous substitution ratio),
it was concluded that mitochondrial proteins were under strong purifying selection in
Denisovans
50
. Conversely, the higher dN/dS ratio calculated from Neanderthal genomes was
attributed to smaller effective population size and inefficient purifying selection
23,117
The
heterozygosity of the TAS2R38 locus in a single Neanderthal individual was used to infer that he
was a bitter-taster and, further, that this trait varied among Neanderthals
177
correlation along
the genome, it was suggested that the Denisovans experienced a 30-fold decrease in effective
population size compared with African humans.
178
Sporadic calculations from museum data can
likewise be extremely useful in inferring population history, for example for determining
genetic continuity of populations.
179
130
3.4. Beyond DNA sequence: ancient epigenomics
Many human phenotypes, including physical and psychological characteristics and
predispositions to chronic diseases, arise from complex patterns of gene expression, which are,
in turn, influenced by poorly understood interactions of so-called ‘genetic determinants’ and
external environmental signals. These intricate interactions are commonly explained by (equally
poorly understood) epigenetic mechanisms including DNA methylation, histone modifications,
and a spectrum of non-coding RNAs that modify the structure of chromatin and modulate gene
expression. Until recently, reconstructing the gene expression profile of specific postmortem
samples using only DNA was deemed impossible. Several research groups analysing the
methylation maps (methylome) of Neanderthals proposed that patterns of CpG methylation
could be preserved in the DNA.
130,180
In 2012, Llamas et al. applied bisulfite allelic sequencing of
loci to late Pleistocene Bison priscus remains and demonstrated preservation of methylation
patterns,
181
although postmortem deamination of methylated cytosine to thymine prevented
accurate quantification of methylated cytosine levels.
In 2014, another approach for genome-wide methylation studies of ancient samples was
suggested.
25,180,182
In bisulfite sequencing, unmethylated cytosines are chemically converted
into uracils, which are then amplified by some polymerases, such as Taq, as thymines (T), while
mC is unaffected and is amplified as C. In postmortem samples C→U and mC→T spontaneous
conversions occur naturally. To discriminate between C→U and mC→T, the same fragments
must be amplified by two different polymerases (Fig. 3.4). Taq DNA polymerases can replicate
through uracils, while high-fidelity DNA polymerases, like Phusion (Pfu), cannot. Thus, it is
possible to detect methylated cytosines in aDNA by their elevated C/T mismatch rates as
compared with unmethylated cytosines. However, analysis of aDNA methylation is limited to
approximately 10 nucleotides from the fragment's ends. There are two reasons for this. First,
the probability of deamination drops exponentially with the distance of a C nucleotide from the
fragment’s end.
130
Second, the further the methylated C is located from the fragment's end, the
131
higher the probability that Pfu will encounter (and will fail to bypass) uracil originated from the
conversion of unmethylated cytosine and will not reach C.
Figure 3.4. Epigenetic analysis of aDNA. As a result of cytosine and methyl-cytosine deamination in postmortem
sample we observe C→U and mC→T conversions. When Taq polymerase is used for DNA amplification, both
C→U and mC→T will be recorded as T (this is the major difference between ancient and bisulfite-treated
samples when only unmethylated cytosine in converted to U while mC remains unchanged). When Pfu
polymerase is used, U will not be amplified, while those T that appeared as a result of mC→T conversion will be
read as T. The pie charts demonstrate the ratio of sequenced C (red) to T (blue). This C/T ratio with Taq and Pfu
along with comparison with the reference genome allows detection of methylated cytosines: in the case of
postmortem deamination C→U and PCR by Pfu the frequency of T will be decreased.
The described strategy was applied to analyse aDNA from Neanderthal (50 kya), Denisovan (40
kya), and a relatively recent Palaeo-Eskimo individual (4 kya). Overall, DNA methylation
patterns in ancient human bones or hairs were almost indistinguishable from those in modern
humans. However, by examining differentially methylated regions, Gokhman et al.
180
found that
132
some key regulators of limb development, like HOXD9 and HOXD10, had methylated promoters
(in Neanderthal) and gene bodies (in Denisovan), whereas these regions are hypo-methylated
in bones of present-day humans. Deregulation of the HOXD cluster genes results in
morphological changes in mice.
183
Since this deregulation corresponds to Neanderthal-modern
human differences, it can be inferred that epigenetic changes in the HOXD clusters might have
played a key role in the recent evolution of human limbs. Differentially methylated regions
were also found within the MEIS1 gene, which encodes a protein that controls the activity of
the HOXD cluster.
183
Interpretation of ancient methylome from aggregated C/T mismatch information over large
genomic regions allows to determine whether extended regions with altered DNA methylation
were present in ancient samples. These include not only hypermethylated CpG islands but also
(i) large (from 10
5
to 10
6
bp long) partially methylated gene-poor domains that co-localize with
lamina-associated domains;
184,185
(ii) DNA methylation valleys extending over several kb of DNA,
which are strongly hypomethylated in most tissues, enriched in transcription factors and
developmental genes;
186,187
(iii) undermethylated canyons (up to dozens of kb) that were
recently identified in hematopoietic stem cells;
188
and (iv) epigenetic programmes associated
with intestinal inflammation and characterized by hypermethylation of DNA methylation valleys
with low CpG density and active chromatin marks.
189
Methylation analysis typically focuses on genomic regions that span several kb or even mb.
187-
189
The C→T mismatch aggregation strategy, applied in a recent aDNA epigenomic study,
25
could yield new perspectives on adaptation signals and disease markers if soft tissues are
found. Brain, intestine, muscle, and blood are not normally preserved in anthropological
samples, and extreme conditions (such as permafrost soil) are required for preservation.
Analysis of epigenetic patterns also allows estimation of the individual’s age at death using a
recent forensic study that found a correlation between the methylation state of specific CpGs
and the age of an individual.
190
Such calculations are based on the assumption that
133
environmental signals 6 kya produced the same genomic methylation response observed today
to estimate the age of ancient humans using modern databases. Using this approach, Pedersen
et al. calculated that the Saqqaq individual was probably in his late thirties when he died.
25
Methylated CpG’s are almost exclusively found in vertebrate somatic cells; bacterial genomes
feature methylated cytosines and adenines but rarely in a CpG context. Hence, CpG
methylation levels can be used to enrich the endogenous content of a human aDNA sample and
separate it from bacterial contaminants.
191
Methyl DNA binding domain (MBD) affinity
chromatography, allowing separation of methylated DNA probes containing a single methylated
CpG, has become a routine method for establishing methylomes of genomes of different
origins.
192
Application of this method to aDNA can facilitate characterization of ancient
methylomes and separate vertebrate and microbial fractions of aDNA extracts. Using the
remains of the Saqqaq Palaeo-Eskimo individual, woolly mammoths, polar bears, and two
equine species, methylation marks were shown to survive in a variety of tissues and
environmental contexts and over a large temporal span (>45–4 kya). Additionally, MBD
enrichment allows microbiome characterization for ancient samples and potentially
reconstruction of genomes of ancient pathogens.
Although DNA methylation may serve as an indicator of gene silencing, epigenetic analysis
alone is insufficient to determine whether the gene was destined for transcription or silencing.
Additional data, such as histone modification marks, chromatin structure, and transcription
factor binding information, are essential for gene activity prediction. Even though research on
ancient proteins is at the nascent stage, shotgun sequencing of aDNA provides a surprisingly
rich source of epigenetic information. Pedersen et al. observed unexpected periodicity in the
density of covered nucleotides along the Saqqaq genome
25
and hypothesized that these
periodic patterns could stem from the protection of DNA by nucleosome binding with
preferential degradation of linker regions between nucleosomes. Under this scenario, the
observed read depth would reflect the nucleosome occupancy. Analysis of the spectral density
134
(periodogram) in transcription start site (TSS) regions showed that the frequency spectrum has
a peak in the relative signal at 193 bp corresponding to the expected inter-nucleosome
distance.
25
Moreover, a phasogram from Fourier transform revealed a short-range (10 bp)
periodicity, reflecting preferential shifts in nucleosome positioning every 10 bp and/or
preferential cleavage of the DNA backbone facing away from nucleosome protection.
193
Strongly positioned nucleosomes in an ancient sample were also found within the vicinity (4 kb)
of the transcriptional repressor CCCTC-binding factor (CTCF) binding sites, and their order was
negatively correlated with uncovered DNA methylation.
25
Since DNase I-hypersensitive sites
(DHSs) near the TSS are reliable predictive markers for gene transcription,
194
regions within
open chromatin structures may be more susceptible to postmortem or apoptosis-induced
DNase cleavage, in which case the density of NGS reads near the TSS of active genes would be
lower than at silent genes. Based on read density at known TSSs and DHSs from the ENCODE
project and using de novo methods of TSS prediction (e.g. NPEST
195
or TSSer
196
), it is possible to
sort TSSs according to transcriptional activity of corresponding genes. In the near future, it may
be feasible to quantitatively reconstruct gene expression patterns of ancient samples by
combining nucleosome positioning, the presence of DHSs at TSSs, and DNA methylation.
Therefore, analysis of preserved brains, such as those from bog bodies,
197
will be of particular
interest. Recently-found remains of a woolly mammoth that retained brain structures of a very
high quality
198
raised hopes that exciting discoveries are on the horizon that would allow us to
test whether the higher nervous system activity in modern humans differs from that of ancient
humans at the epigenetic level.
199,200
135
3.5. Conclusions
aDNA research has revolutionized a multitude of scientific disciplines. Representing the most
direct route to address a large number of questions in evolution, medicine, anthropology, and
history, aDNA became an indispensable tool in population genetics, paleo-epidemiology, and
related fields. Analysis of aDNA has made tremendous progress since its humble beginning in
the early 1990s, when contamination with modern DNA sources was commonplace, and only
limited analysis was possible due to DNA fragmentation and sparse sampling. In this review, we
attempted to provide a detailed overview of recent innovations aimed at coping with these
limitations, both through experimental procedures and bioinformatics algorithms. We also
considered challenges regarding aDNA biochemistry and degradation, particular bioinformatics
tools compensating for short reads and gaps in sequencing coverage, and advances in
population genetics to handle sparse sampling. Finally, we described the particularities of aDNA
epigenetics and functional interpretation of deduced activities of genes and pathways.
In envisioning future progress in aDNA studies, we would like to note that not every advance in
genomics or experimental biology may affect the field. Recent breakthroughs in genomic
technologies drastically increased the amount of information obtained from aDNA, and new
inventions, e.g. progress in targeted enrichment methods and single-molecule sequencing,
would likely allow investigation of previously intractable samples from hot climates and more
distant eras. However, experimental approaches will always be limited by the quantity and
quality of aDNA in ancient remains. Thus, development of computational methods to cope with
aDNA-specific biases and extract meaningful information from low-coverage aDNA data is
critical. Studies of aDNA will hugely benefit from further improvement of sophisticated
bioinformatics tools coupled with the rapid accumulation of content (reference genomes and
variant databases) from both ancient samples and freshly sequenced modern human
populations. Regarding the latter, one can hardly overestimate the effect of international
projects on systematic genotyping and sequencing of small and/or remote human populations
(see, for example, recent sequencing of 236 individuals from 125 distinct human populations by
136
Sudmant et al.,
201
a study of 456 geographically diverse high-coverage Y chromosome
sequences to infer second strong bottleneck in Y-chromosome lineage by Karmin et al.,
202
genotyping and comprehensive analysis of 2,039 samples from rural areas within the UK by
Leslie et al.,
203
as well as many others projects
13,158,167,204-208
). Parallel improvement of
experimental and computational methods will enable studies of ancient populations instead of
just a few individuals, and new studies of this kind are emerging now.
11,12,28
The utility of aDNA
data will increase with further progress in the genotype-to-phenotype mapping of humans. For
the first time, we can anticipate the direct study of evolution for traits that are not associated
with the fossil record, such as metabolic and behavioural details. aDNA will provide an
important source of information on the origins of cells that harboured DNA thousands of years
ago, the age of samples at the time of death, and the environmental influences. Altogether,
analysis of aDNA will help us to better understand our world and our role in it.
137
Acknowledgements
IM was supported by Swiss Mäxi Foundation grant. HA was supported by NIH grants GM098741
and MH091561. EP was supported by Russian Science Foundation grant 14-14-01202. PF was
supported by the Moravian-Silesian region projects MSK2013-DT1, MSK2013-DT2, and
MSK2014-DT1 and by the Institution Development Program of the University of Ostrava. TVT,
EE, and PP were supported by NSF Division of Environmental Biology award 1456634. EE was
supported by The Royal Society International Exchanges Award (IE140020) and MRC Confidence
in Concept Scheme award 2014-University of Sheffield (Ref: MC_PC_14115). ER was supported
by Russian Science Foundation grant 14-50-00029. We are grateful to Drs. Lana Grinberg, David
E. Cobrinik, Roger Jelliffe, and Steven D. Aird for helpful comments. The idea of the manuscript
was conceptualized during the Ancient DNA Symposium sponsored by the Okinawa Institute of
Science and Technology.
138
References
1. Green, R. E., Krause, J., Briggs, A. W., et al. 2010, A draft sequence of the Neandertal
genome. Science, 328, 710-722.
2. Rasmussen, M., Li, Y., Lindgreen, S., et al. 2010, Ancient human genome sequence of an
extinct Palaeo-Eskimo. Nature, 463, 757-762.
3. Reich, D., Green, R. E., Kircher, M., et al. 2010, Genetic history of an archaic hominin
group from Denisova Cave in Siberia. Nature, 468, 1053-1060.
4. Meyer, M., Kircher, M., Gansauge, M. T., et al. 2012, A high-coverage genome sequence
from an archaic Denisovan individual. Science, 338, 222-226.
5. Orlando, L., Ginolhac, A., Zhang, G., et al. 2013, Recalibrating Equus evolution using the
genome sequence of an early Middle Pleistocene horse. Nature, 499, 74-78.
6. Bos, K. I., Harkins, K. M., Herbig, A., et al. 2014, Pre-Columbian mycobacterial genomes
reveal seals as a source of New World human tuberculosis. Nature, 514, 494-497.
7. Lazaridis, I., Patterson, N., Mittnik, A., et al. 2014, Ancient human genomes suggest
three ancestral populations for present-day Europeans. Nature, 513, 409-413.
8. Prufer, K., Racimo, F., Patterson, N., et al. 2014, The complete genome sequence of a
Neanderthal from the Altai Mountains. Nature, 505, 43-49.
9. Raghavan, M., DeGiorgio, M., Albrechtsen, A., et al. 2014, The genetic prehistory of the
New World Arctic. Science, 345, 1255832.
10. Raghavan, M., Skoglund, P., Graf, K. E., et al. 2014, Upper Palaeolithic Siberian genome
reveals dual ancestry of Native Americans. Nature, 505, 87-91.
11. Allentoft, M. E., Sikora, M., Sjogren, K. G., et al. 2015, Population genomics of Bronze
Age Eurasia. Nature, 522, 167-172.
12. Haak, W., Lazaridis, I., Patterson, N., et al. 2015, Massive migration from the steppe was
a source for Indo-European languages in Europe. Nature, 522, 207-211.
13. Raghavan, M., Steinrucken, M., Harris, K., et al. 2015, POPULATION GENETICS. Genomic
evidence for the Pleistocene and recent population history of Native Americans. Science, 349,
aab3884.
14. Skoglund, P., Malmstrom, H., Raghavan, M., et al. 2012, Origins and genetic legacy of
Neolithic farmers and hunter-gatherers in Europe. Science, 336, 466-469.
15. Higuchi, R., Bowman, B., Freiberger, M., Ryder, O. A. and Wilson, A. C. 1984, DNA
sequences from the quagga, an extinct member of the horse family. Nature, 312, 282-284.
16. Pääbo, S., Poinar, H., Serre, D., et al. 2004, Genetic analyses from ancient DNA. Annual
review of genetics, 38, 645-679.
139
17. Hagelberg, E. and Clegg, J. B. 1991, Isolation and characterization of DNA from
archaeological bone. Proc Biol Sci, 244, 45-50.
18. Stone, A. C., Milner, G. R., Paabo, S. and Stoneking, M. 1996, Sex determination of
ancient human skeletons using DNA. American journal of physical anthropology, 99, 231-238.
19. Taubenberger, J. K., Reid, A. H., Krafft, A. E., Bijwaard, K. E. and Fanning, T. G. 1997,
Initial genetic characterization of the 1918 "Spanish" influenza virus. Science, 275, 1793-1796.
20. Poinar, H. N., Schwarz, C., Qi, J., et al. 2006, Metagenomics to Paleogenomics: Large-
Scale Sequencing of Mammoth DNA. Science, 311, 392-394.
21. Lalueza-Fox, C., Rompler, H., Caramelli, D., et al. 2007, A melanocortin 1 receptor allele
suggests varying pigmentation among Neanderthals. Science, 318, 1453-1455.
22. Krause, J., Lalueza-Fox, C., Orlando, L., et al. 2007, The derived FOXP2 variant of modern
humans was shared with Neandertals. Current biology : CB, 17, 1908-1912.
23. Green, R. E., Malaspinas, A. S., Krause, J., et al. 2008, A Complete Neandertal
Mitochondrial Genome Sequence Determined by High-Throughput Sequencing. Cell, 134, 416-
426.
24. Gilbert, M. T., Kivisild, T., Gronnow, B., et al. 2008, Paleo-Eskimo mtDNA genome reveals
matrilineal discontinuity in Greenland. Science, 320, 1787-1789.
25. Pedersen, J. S., Valen, E., Velazquez, A. M., et al. 2014, Genome-wide nucleosome map
and cytosine methylation levels of an ancient human genome. Genome Res, 24, 454-466.
26. Llorente, M. G., Jones, E. R., Eriksson, A., et al. 2015, Ancient Ethiopian genome reveals
extensive Eurasian admixture throughout the African continent. Science.
27. Meyer, M., Arsuaga, J. L., de Filippo, C., et al. 2016, Nuclear DNA sequences from the
Middle Pleistocene Sima de los Huesos hominins. Nature, 531, 504-507.
28. Fu, Q., Posth, C., Hajdinjak, M., et al. 2016, The genetic history of Ice Age Europe.
Nature, 534, 200-205.
29. Paabo, S., Gifford, J. A. and Wilson, A. C. 1988, Mitochondrial DNA sequences from a
7000-year old brain. Nucleic Acids Res, 16, 9775-9787.
30. Pickrell, J. K. and Reich, D. 2014, Toward a new history and geography of human genes
informed by ancient DNA. Trends in genetics : TIG, 30, 377-389.
31. Veeramah, K. R. and Hammer, M. F. 2014, The impact of whole-genome sequencing on
the reconstruction of human population history. Nature reviews. Genetics, 15, 149-162.
32. Marean, C. W., Anderson, R. J., Bar-Matthews, M., et al. 2015, A new research strategy
for integrating studies of paleoclimate, paleoenvironment, and paleoanthropology. Evol
Anthropol, 24, 62-72.
140
33. Jones, M. 2002, The molecule hunt: archaeology and the search for ancient DNA.
Penguin: London.
34. Keller, A., Graefen, A., Ball, M., et al. 2012, New insights into the Tyrolean Iceman's
origin and phenotype as inferred by whole-genome sequencing. Nature communications, 3,
698.
35. Olalde, I., Allentoft, M. E., Sanchez-Quinto, F., et al. 2014, Derived immune and ancestral
pigmentation alleles in a 7,000-year-old Mesolithic European. Nature, 507, 225-228.
36. Olalde, I., Sanchez-Quinto, F., Datta, D., et al. 2014, Genomic analysis of the blood
attributed to Louis XVI (1754-1793), king of France. Scientific reports, 4, 4666.
37. Cerqueira, C. C., Paixao-Cortes, V. R., Zambra, F. M., Salzano, F. M., Hunemeier, T. and
Bortolini, M. C. 2012, Predicting Homo pigmentation phenotype through genomic data: from
Neanderthal to James Watson. American journal of human biology : the official journal of the
Human Biology Council, 24, 705-709.
38. Cale, C. M. 2015, Forensic DNA evidence is not infallible. Nature, 526, 611.
39. Lohse, K. and Frantz, L. A. 2014, Neandertal admixture in Eurasia confirmed by
maximum-likelihood analysis of three genomes. Genetics, 196, 1241-1251.
40. Sankararaman, S., Mallick, S., Dannemann, M., et al. 2014, The genomic landscape of
Neanderthal ancestry in present-day humans. Nature, 507, 354-357.
41. Vernot, B. and Akey, J. M. 2015, Complex history of admixture between modern humans
and Neandertals. American journal of human genetics, 96, 448-453.
42. Vernot, B. and Akey, J. M. 2014, Human evolution: genomic gifts from archaic hominins.
Current biology : CB, 24, R845-848.
43. Vernot, B. and Akey, J. M. 2014, Resurrecting surviving Neandertal lineages from
modern human genomes. Science, 343, 1017-1021.
44. Fu, Q., Hajdinjak, M., Moldovan, O. T., et al. 2015, An early modern human from
Romania with a recent Neanderthal ancestor. Nature, 524, 216-219.
45. Sanchez-Quinto, F. and Lalueza-Fox, C. 2015, Almost 20 years of Neanderthal
palaeogenetics: adaptation, admixture, diversity, demography and extinction. Philosophical
transactions of the Royal Society of London. Series B, Biological sciences, 370, 20130374.
46. Kuhlwilm, M., Gronau, I., Hubisz, M. J., et al. 2016, Ancient gene flow from early modern
humans into Eastern Neanderthals. Nature, 530, 429-433.
47. Wall, J. D., Yang, M. A., Jay, F., et al. 2013, Higher levels of neanderthal ancestry in East
Asians than in Europeans. Genetics, 194, 199-209.
48. Krause, J., Orlando, L., Serre, D., et al. 2007, Neanderthals in central Asia and Siberia.
Nature, 449, 902-904.
141
49. Racimo, F., Sankararaman, S., Nielsen, R. and Huerta-Sanchez, E. 2015, Evidence for
archaic adaptive introgression in humans. Nature reviews. Genetics, 16, 359-371.
50. Krause, J., Fu, Q., Good, J. M., et al. 2010, The complete mitochondrial DNA genome of
an unknown hominin from southern Siberia. Nature, 464, 894-897.
51. Sawyer, S., Renaud, G., Viola, B., et al. 2015, Nuclear and mitochondrial DNA sequences
from two Denisovan individuals. Proceedings of the National Academy of Sciences of the United
States of America, 112, 15696-15700.
52. Reich, D., Patterson, N., Kircher, M., et al. 2011, Denisova admixture and the first
modern human dispersals into Southeast Asia and Oceania. American journal of human
genetics, 89, 516-528.
53. Krause, J., Briggs, A. W., Kircher, M., et al. 2010, A Complete mtDNA Genome of an Early
Modern Human from Kostenki, Russia. Current Biology, 20, 231-236.
54. Qin, P. and Stoneking, M. 2015, Denisovan Ancestry in East Eurasian and Native
American Populations. Molecular biology and evolution, 32, 2665-2674.
55. Eriksson, A. and Manica, A. 2012, Effect of ancient population structure on the degree of
polymorphism shared between modern human populations and ancient hominins. Proceedings
of the National Academy of Sciences of the United States of America, 109, 13956-13960.
56. Sankararaman, S., Mallick, S., Patterson, N. and Reich, D. 2016, The Combined
Landscape of Denisovan and Neanderthal Ancestry in Present-Day Humans. Current biology :
CB, 26, 1241-1247.
57. Meyer, M., Fu, Q., Aximu-Petri, A., et al. 2014, A mitochondrial genome sequence of a
hominin from Sima de los Huesos. Nature, 505, 403-406.
58. Arsuaga, J. L., Martinez, I., Arnold, L. J., et al. 2014, Neandertal roots: Cranial and
chronological evidence from Sima de los Huesos. Science, 344, 1358-1363.
59. Arsuaga, J. L., Carretero, J.-M., Lorenzo, C., et al. 2015, Postcranial morphology of the
middle Pleistocene humans from Sima de los Huesos, Spain. PNAS, 112, 11524-11529.
60. Mathieson, I., Lazaridis, I., Rohland, N., et al. 2015, Genome-wide patterns of selection
in 230 ancient Eurasians. Nature, 528, 499-503.
61. Anthony, D. W. 2007, The horse, the wheel, and language : how Bronze-Age riders from
the Eurasian steppes shaped the modern world. Princeton University Press: Princeton, N.J. ;
Woodstock.
62. Matisoo-Smith, E. 2015, Ancient DNA and the human settlement of the Pacific: a review.
Journal of human evolution, 79, 93-104.
63. Gao, S. Z., Zhang, Y., Wei, D., et al. 2015, Ancient DNA reveals a migration of the ancient
Di-qiang populations into Xinjiang as early as the early Bronze Age. American journal of physical
anthropology, 157, 71-80.
142
64. Zhao, Y. B., Li, H. J., Cai, D. W., et al. 2010, Ancient DNA from nomads in 2500-year-old
archeological sites of Pengyang, China. J Hum Genet, 55, 215-218.
65. Liu, W., Martinon-Torres, M., Cai, Y. J., et al. 2015, The earliest unequivocally modern
humans in southern China. Nature, 526, 696-699.
66. Lai, C. S., Fisher, S. E., Hurst, J. A., Vargha-Khadem, F. and Monaco, A. P. 2001, A
forkhead-domain gene is mutated in a severe speech and language disorder. Nature, 413, 519-
523.
67. Rasmussen, S., Allentoft, M. E., Nielsen, K., et al. 2015, Early Divergent Strains of Yersinia
pestis in Eurasia 5,000 Years Ago. Cell, 163, 571-582.
68. Brosch, R., Gordon, S. V., Marmiesse, M., et al. 2002, A new evolutionary scenario for
the Mycobacterium tuberculosis complex. Proceedings of the National Academy of Sciences of
the United States of America, 99, 3684-3689.
69. Nerlich, A. G. and Losch, S. 2009, Paleopathology of human tuberculosis and the
potential role of climate. Interdiscip Perspect Infect Dis, 2009, 437187.
70. Darling, M. I. and Donoghue, H. D. 2014, Insights from paleomicrobiology into the
indigenous peoples of pre-colonial America - a review. Mem Inst Oswaldo Cruz, 109, 131-139.
71. Adler, C. J., Dobney, K., Weyrich, L. S., et al. 2013, Sequencing ancient calcified dental
plaque shows changes in oral microbiota with dietary shifts of the Neolithic and Industrial
revolutions. Nat Genet, 45, 450-455, 455e451.
72. Reid, A. H., Fanning, T. G., Hultin, J. V. and Taubenberger, J. K. 1999, Origin and
evolution of the 1918 "Spanish" influenza virus hemagglutinin gene. Proceedings of the
National Academy of Sciences of the United States of America, 96, 1651-1656.
73. Taubenberger, J. K., Hultin, J. V. and Morens, D. M. 2007, Discovery and characterization
of the 1918 pandemic influenza virus in historical context. Antivir Ther, 12, 581-591.
74. Taubenberger, J. K., Morens, D. M. and Fauci, A. S. 2007, The next influenza pandemic:
can it be predicted? JAMA, 297, 2025-2027.
75. Tumpey, T. M., Basler, C. F., Aguilar, P. V., et al. 2005, Characterization of the
reconstructed 1918 Spanish influenza pandemic virus. Science, 310, 77-80.
76. Tumpey, T. M., Garcia-Sastre, A., Taubenberger, J. K., Palese, P., Swayne, D. E. and
Basler, C. F. 2004, Pathogenicity and immunogenicity of influenza viruses with genes from the
1918 pandemic virus. Proceedings of the National Academy of Sciences of the United States of
America, 101, 3166-3171.
77. Millar, C. D., Huynen, L., Subramanian, S., Mohandesan, E. and Lambert, D. M. 2008,
New developments in ancient genomics. Trends in ecology & evolution, 23, 386-393.
143
78. Parks, M., Subramanian, S., Baroni, C., et al. 2015, Ancient population genomics and the
study of evolution. Philosophical transactions of the Royal Society of London. Series B, Biological
sciences, 370, 20130381.
79. Haile, J., Froese, D. G., Macphee, R. D., et al. 2009, Ancient DNA reveals late survival of
mammoth and horse in interior Alaska. Proceedings of the National Academy of Sciences of the
United States of America, 106, 22352-22357.
80. Handt, O., Hoss, M., Krings, M. and Paabo, S. 1994, Ancient DNA: methodological
challenges. Experientia, 50, 524-529.
81. Handt, O., Richards, M., Trommsdorff, M., et al. 1994, Molecular genetic analyses of the
Tyrolean Ice Man. Science, 264, 1775-1778.
82. Lia, V. V., Confalonieri, V. a., Ratto, N., et al. 2007, Microsatellite typing of ancient
maize: insights into the history of agriculture in southern South America. Proceedings.
Biological sciences / The Royal Society, 274, 545-554.
83. Haak, W., Balanovsky, O., Sanchez, J. J., et al. 2010, Ancient DNA from European early
neolithic farmers reveals their near eastern affinities. PLoS Biol, 8, e1000536.
84. Malaspinas, A. S., Lao, O., Schroeder, H., et al. 2014, Two ancient human genomes
reveal Polynesian ancestry among the indigenous Botocudos of Brazil. Current biology : CB, 24,
R1035-1037.
85. Green, R. E., Briggs, A. W., Krause, J., et al. 2009, The Neandertal genome and ancient
DNA authenticity. The EMBO journal, 28, 2494-2502.
86. Green, R. E., Krause, J., Ptak, S. E., et al. 2006, Analysis of one million base pairs of
Neanderthal DNA. Nature, 444, 330-336.
87. Dabney, J., Knapp, M., Glocke, I., et al. 2013, Complete mitochondrial genome sequence
of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments. Proceedings of
the National Academy of Sciences of the United States of America, 110, 15758-15763.
88. Gilbert, M. T. P., Tomsho, L. P., Rendulic, S., et al. 2007, Whole-Genome Shotgun
Sequencing of Mitochondria from Ancient Hair Shafts. Science, 317, 1927-1930.
89. Miller, W., Drautz, D. I., Ratan, A., et al. 2008, Sequencing the nuclear genome of the
extinct woolly mammoth. Nature, 456, 387-390.
90. Golenberg, E. M. 1991, Amplification and analysis of Miocene plant fossil DNA.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 333, 419-
426; discussion 426-417.
91. Golenberg, E. M., Giannasi, D. E., Clegg, M. T., et al. 1990, Chloroplast DNA sequence
from a miocene Magnolia species. Nature, 344, 656-658.
92. Cano, R. J. and Poinar, H. N. 1993, Rapid isolation of DNA from fossil and museum
specimens suitable for PCR. BioTechniques, 15, 432-434, 436.
144
93. Cano, R. J., Poinar, H. N., Pieniazek, N. J., Acra, A. and Poinar, G. O., Jr. 1993,
Amplification and sequencing of DNA from a 120-135-million-year-old weevil. Nature, 363, 536-
538.
94. Vreeland, R. H., Rosenzweig, W. D. and Powers, D. W. 2000, Isolation of a 250 million-
year-old halotolerant bacterium from a primary salt crystal. Nature, 407, 897-900.
95. Woodward, S. R., Weyand, N. J. and Bunnell, M. 1994, DNA sequence from Cretaceous
period bone fragments. Science, 266, 1229-1232.
96. Li, Y., An, C.-C. and Zhu, Y.-X. 1995, DNA isolation and sequence analysis of dinosaur
DNA from Cretaceous dinosaur egg in Xixia Henan, China. Acta Sci. Nat. Univ. Pekinensi 31, 148–
152.
97. Cooper, A. and Poinar, H. N. 2000, Ancient DNA: do it right or not at all. Science, 289,
1139.
98. Austin, J. J., Ross, A. J., Smith, A. B., Fortey, R. A. and Thomas, R. H. 1997, Problems of
reproducibility--does geologically ancient DNA survive in amber-preserved insects? Proc Biol Sci,
264, 467-474.
99. Allard, M. W., Young, D. and Huyen, Y. 1995, Detecting dinosaur DNA. Science, 268,
1192; author reply 1194.
100. Hedges, S. B. and Schweitzer, M. H. 1995, Detecting dinosaur DNA. Science, 268, 1191-
1192; author reply 1194.
101. Henikoff, S. 1995, Detecting dinosaur DNA. Science, 268, 1192; author reply 1194.
102. Zischler, H., Hoss, M., Handt, O., von Haeseler, A., van der Kuyl, A. C. and Goudsmit, J.
1995, Detecting dinosaur DNA. Science, 268, 1192-1193; author reply 1194.
103. Austin, J. J., Smith, A. B. and Thomas, R. H. 1997, Palaeontology in a molecular world:
the search for authentic ancient DNA. Trends in ecology & evolution, 12, 303-306.
104. Graur, D. and Pupko, T. 2001, The Permian bacterium that isn't. Molecular biology and
evolution, 18, 1143-1146.
105. Gutierrez, G. and Marin, A. 1998, The most ancient DNA recovered from an amber-
preserved specimen may not be as ancient as it seems. Molecular biology and evolution, 15,
926-929.
106. Nicholls, H. 2005, Ancient DNA comes of age. PLoS Biol, 3, e56.
107. Knapp, M., Lalueza-Fox, C. and Hofreiter, M. 2015, Re-inventing ancient human DNA.
Investigative genetics, 6, 4.
108. Damgaard, P. B., Margaryan, A., Schroeder, H., Orlando, L., Willerslev, E. and Allentoft,
M. E. 2015, Improving access to endogenous DNA in ancient bones and teeth. Scientific reports,
5, 11184.
145
109. Mikheyev, A. S., Tin, M. M., Arora, J. and Seeley, T. D. 2015, Museum samples reveal
rapid evolution by wild honey bees exposed to a novel parasite. Nature communications, 6,
7991.
110. Gansauge, M. T. and Meyer, M. 2014, Selective enrichment of damaged DNA molecules
for ancient genome sequencing. Genome Res, 24, 1543-1549.
111. Burbano, H. a., Hodges, E., Green, R. E., et al. 2010, Targeted Investigation of the
Neanderthal Genome by Array-Based Sequence Capture. Science, 328, 723-725.
112. Prüfer, K., Stenzel, U., Hofreiter, M., Pääbo, S., Kelso, J. and Green, R. E. 2010,
Computational challenges in the analysis of ancient DNA. Genome biology, 11, R47-R47.
113. Oskam, C. L., Haile, J., McLay, E., et al. 2010, Fossil avian eggshell preserves ancient DNA.
Proceedings of the Royal Society B: Biological Sciences, 277, 1991-2000.
114. Poinar, H., Kuch, M., McDonald, G., Martin, P. and Pääbo, S. 2003, Nuclear gene
sequences from a late Pleistocene sloth coprolite. Current biology, 13, 1150-1152.
115. Adcock, G. J., Dennis, E. S., Easteal, S., et al. 2001, Mitochondrial DNA sequences in
ancient Australians: Implications for modern human origins. Proceedings of the National
Academy of Sciences of the United States of America, 98, 537-542.
116. Noonan, J. P., Coop, G., Kudaravalli, S., et al. 2006, Sequencing and Analysis of
Neanderthal Genomic DNA. Science, 314, 1113-1118.
117. Briggs, A. W., Good, J. M., Green, R. E., et al. 2009, Targeted retrieval and analysis of five
Neandertal mtDNA genomes. Science, 325, 318-321.
118. Lalueza-Fox, C., Gigli, E., de la Rasilla, M., et al. 2008, Genetic characterization of the
ABO blood group in Neandertals. BMC evolutionary biology, 8, 342-342.
119. Rasmussen, M., Guo, X., Wang, Y., et al. 2011, An Aboriginal Australian genome reveals
separate human dispersals into Asia. Science, 334, 94-98.
120. Kolman, C. J. and Tuross, N. 2000, Ancient DNA analysis of human populations.
American journal of physical anthropology, 111, 5-23.
121. Lacan, M., Keyser, C., Ricaut, F.-X., et al. 2011, Ancient DNA reveals male diffusion
through the Neolithic Mediterranean route. Proceedings of the National Academy of Sciences of
the United States of America, 108, 9788-9791.
122. Shapiro, B., Drummond, A. J., Rambaut, A., et al. 2004, Rise and Fall of the Beringian
Steppe Bison. Science, 306, 1561-1565.
123. Barnes, I., Matheus, P., Shapiro, B., Jensen, D. and Cooper, A. 2002, Dynamics of
Pleistocene Population Extinctions in Beringian Brown Bears. Science, 295, 2267-2270.
124. Hofreiter, M., Jaenicke, V., Serre, D., von Haeseler, A. and Paabo, S. 2001, DNA
sequences from multiple amplifications reveal artifacts induced by cytosine deamination in
ancient DNA. Nucleic Acids Res, 29, 4793-4799.
146
125. Gilbert, M. T., Hansen, A. J., Willerslev, E., et al. 2003, Characterization of genetic
miscoding lesions caused by postmortem damage. American journal of human genetics, 72, 48-
61.
126. Gilbert, M. T., Willerslev, E., Hansen, A. J., et al. 2003, Distribution patterns of
postmortem damage in human mitochondrial DNA. American journal of human genetics, 72,
32-47.
127. Hansen, A., Willerslev, E., Wiuf, C., Mourier, T. and Arctander, P. 2001, Statistical
evidence for miscoding lesions in ancient DNA templates. Molecular biology and evolution, 18,
262-265.
128. Briggs, A. W., Stenzel, U., Johnson, P. L. F., et al. 2007, Patterns of damage in genomic
DNA sequences from a Neandertal. Proceedings of the National Academy of Sciences of the
United States of America, 104, 14616-14621.
129. Lindahl, T., Ljungquist, S., Siegert, W., Nyberg, B. and Sperens, B. 1977, DNA N-
glycosidases: properties of uracil-DNA glycosidase from Escherichia coli. The Journal of
biological chemistry, 252, 3286-3294.
130. Briggs, A. W., Stenzel, U., Meyer, M., Krause, J., Kircher, M. and Paabo, S. 2010, Removal
of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res,
38, e87.
131. Willerslev, E., Davison, J., Moora, M., et al. 2014, Fifty thousand years of Arctic
vegetation and megafaunal diet. Nature, 506, 47-51.
132. Tin, M. M., Rheindt, F. E., Cros, E. and Mikheyev, A. S. 2015, Degenerate adaptor
sequences for detecting PCR duplicates in reduced representation sequencing data improve
genotype calling accuracy. Mol Ecol Resour, 15, 329-336.
133. Burbano, H. a., Green, R. E., Maricic, T., et al. 2012, Analysis of human accelerated DNA
regions using archaic hominin genomes. PloS one, 7, 1-8.
134. Kao, W. C., Stevens, K. and Song, Y. S. 2009, BayesCall: A model-based base-calling
algorithm for high-throughput short-read sequencing. Genome Res, 19, 1884-1895.
135. Renaud, G., Kircher, M., Stenzel, U. and Kelso, J. 2013, freeIbis: an efficient basecaller
with calibrated quality scores for Illumina sequencers. Bioinformatics, 29, 1208-1209.
136. Andrews, S. 2015, Babraham Bioinformatics. Babraham Institute.
137. Lindgreen, S. 2012, AdapterRemoval: easy cleaning of next-generation sequencing
reads. BMC research notes, 5, 337.
138. Martin, M. 2015, Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.journal, 17.
139. St. John, J. 2015, SeqPrep.
147
140. Schubert, M., Ginolhac, A., Lindgreen, S., et al. 2012, Improving ancient DNA read
mapping against modern reference genomes. BMC Genomics, 13, 178.
141. DePristo, M. A., Banks, E., Poplin, R., et al. 2011, A framework for variation discovery
and genotyping using next-generation DNA sequencing data. Nat Genet, 43, 491-498.
142. McKenna, A., Hanna, M., Banks, E., et al. 2010, The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data. Genome research,
20, 1297-1303.
143. Li, H., Handsaker, B., Wysoker, A., et al. 2009, The Sequence Alignment/Map format and
SAMtools. Bioinformatics, 25, 2078-2079.
144. Renaud, G. 2015, aLib: a sequencing pipeline for ancient and modern DNA.
145. Schubert, M., Ermini, L., Der Sarkissian, C., et al. 2014, Characterization of ancient and
modern genomes by SNP detection and phylogenomic and metagenomic analysis using
PALEOMIX. Nature protocols, 9, 1056-1082.
146. Renaud, G., Stenzel, U. and Kelso, J. 2014, leeHom: adaptor trimming and merging for
Illumina sequencing reads. Nucleic Acids Res, 42, e141.
147. Kerpedjiev, P., Frellsen, J., Lindgreen, S. and Krogh, A. 2014, Adaptable probabilistic
mapping of short reads using position specific scoring matrices. BMC bioinformatics, 15, 100.
148. Ginolhac, A., Rasmussen, M., Gilbert, M. T., Willerslev, E. and Orlando, L. 2011,
mapDamage: testing for damage patterns in ancient DNA sequences. Bioinformatics, 27, 2153-
2155.
149. Jonsson, H., Ginolhac, A., Schubert, M., Johnson, P. L. and Orlando, L. 2013,
mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters.
Bioinformatics, 29, 1682-1684.
150. Jakobsson, M. 2015, PMDtools. http://www.ieg.uu.se/jakobsson/software/pmdtools/.
151. Renaud, G., Slon, V., Duggan, A. T. and Kelso, J. 2015, Schmutzi: estimation of
contamination and endogenous mitochondrial consensus calling for ancient DNA. Genome Biol,
16, 224.
152. Lindgreen, S., Krogh, A. and Pedersen, J. S. 2014, SNPest: a probabilistic graphical model
for estimating genotypes. BMC research notes, 7, 698.
153. Callaway, E. 2016, Error found in study of first ancient African genome. Nature.
154. Rivollat, M., Mendisco, F., Pemonge, M. H., et al. 2015, When the waves of European
neolithization met: first paleogenetic evidence from early farmers in the southern paris basin.
PloS one, 10, e0125521.
155. Der Sarkissian, C., Balanovsky, O., Brandt, G., et al. 2013, Ancient DNA reveals
prehistoric gene-flow from siberia in the complex human population history of North East
Europe. PLoS Genet, 9, e1003296.
148
156. Skoglund, P., Malmstrom, H., Omrak, A., et al. 2014, Genomic diversity and admixture
differs for Stone-Age Scandinavian foragers and farmers. Science, 344, 747-750.
157. Skoglund, P., Stora, J., Gotherstrom, A. and Jakobsson, M. 2013, Accurate sex
identification of ancient human remains using DNA shotgun sequencing. Journal of
Archaeological Science, 40, 4477–4482.
158. Brotherton, P., Haak, W., Templeton, J., et al. 2013, Neolithic mitochondrial haplogroup
H genomes and the genetic origins of Europeans. Nature communications, 4, 1764.
159. Moorjani, P., Sankararaman, S., Fu, Q., Przeworski, M., Patterson, N. and Reich, D. 2016,
A genetic method for dating ancient genomes provides a direct estimate of human generation
interval in the last 45,000 years. Proceedings of the National Academy of Sciences of the United
States of America, 113, 5652-5657.
160. Elhaik, E. 2012, Empirical Distributions of F ST from Large-Scale Human Polymorphism
Data. PloS one, 7, e49837.
161. Novembre, J., Johnson, T., Bryc, K., et al. 2008, Genes mirror geography within Europe.
Nature, 456, 98-101.
162. Falush, D., Stephens, M. and Pritchard, J. K. 2003, Inference of population structure
using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164,
1567-1587.
163. Alexander, D. H., Novembre, J. and Lange, K. 2009, Fast model-based estimation of
ancestry in unrelated individuals. Genome Res, 19, 1655-1664.
164. Yang, W. Y., Platt, A., Chiang, C. W., Eskin, E., Novembre, J. and Pasaniuc, B. 2014, Spatial
localization of recent ancestors for admixed individuals. G3 (Bethesda), 4, 2505-2518.
165. Yang, W. Y., Novembre, J., Eskin, E. and Halperin, E. 2012, A model-based approach for
analysis of spatial structure in genetic data. Nature Genetics, 44, 725-731.
166. Patterson, N., Moorjani, P., Luo, Y., et al. 2012, Ancient admixture in human history.
Genetics, 192, 1065-1093.
167. Elhaik, E., Tatarinova, T., Chebotarev, D., et al. 2014, Geographic population structure
analysis of worldwide human populations infers their biogeographical origins. Nature
communications, 5, 3513.
168. Sankararaman, S., Sridhar, S., Kimmel, G. and Halperin, E. 2008, Estimating local ancestry
in admixed populations. Am J Hum Genet, 82, 290-303.
169. Price, A. L., Tandon, A., Patterson, N., et al. 2009, Sensitive detection of chromosomal
segments of distinct ancestry in admixed populations. PLoS Genet, 5, e1000519.
170. Kozlov, K., Chebotarov, D., Hassan, M., et al. 2015, Differential Evolution Approach to
Detect Recent Admixture. BMC Genomics, 16, S9.
149
171. Churchhouse, C. and Marchini, J. 2013, Multiway admixture deconvolution using phased
or unphased ancestral panels. Genet Epidemiol, 37, 1-12.
172. Sohn, K. A., Ghahramani, Z. and Xing, E. P. 2012, Robust estimation of local genetic
ancestry in admixed populations using a nonparametric Bayesian approach. Genetics, 191,
1295-1308.
173. Tang, H., Coram, M., Wang, P., Zhu, X. and Risch, N. 2006, Reconstructing genetic
ancestry blocks in admixed individuals. American journal of human genetics, 79, 1-12.
174. Skotte, L., Korneliussen, T. S. and Albrechtsen, A. 2013, Estimating individual admixture
proportions from next generation sequencing data. Genetics, 195, 693-702.
175. Limburg, P. A. and Weider, L. J. 2002, 'Ancient' DNA in the resting egg bank of a
microcrustacean can serve as a palaeolimnological database. Proc Biol Sci, 269, 281-287.
176. Jaenicke-Despres, V., Buckler, E. S., Smith, B. D., et al. 2003, Early allelic selection in
maize as revealed by ancient DNA. Science, 302, 1206-1208.
177. Lalueza-Fox, C., Gigli, E., de la Rasilla, M., Fortea, J. and Rosas, A. 2009, Bitter taste
perception in Neanderthals through the analysis of the TAS2R38 gene. Biol Lett, 5, 809-811.
178. Lynch, M., Xu, S., Maruki, T., Jiang, X., Pfaffelhuber, P. and Haubold, B. 2014, Genome-
wide linkage-disequilibrium profiles from single individuals. Genetics, 198, 269-281.
179. Mikheyev, A. S., Bresson, S. and Conant, P. 2009, Single-queen introductions
characterize regional and local invasions by the facultatively clonal little fire ant Wasmannia
auropunctata. Mol Ecol, 18, 2937-2944.
180. Gokhman, D., Lavi, E., Prufer, K., et al. 2014, Reconstructing the DNA methylation maps
of the Neandertal and the Denisovan. Science, 344, 523-527.
181. Llamas, B., Holland, M. L., Chen, K., Cropley, J. E., Cooper, A. and Suter, C. M. 2012, High-
resolution analysis of cytosine methylation in ancient DNA. PloS one, 7, e30226.
182. Smith, O., Clapham, A. J., Rose, P., Liu, Y., Wang, J. and Allaby, R. G. 2014, Genomic
methylation patterns in archaeological barley show de-methylation as a time-dependent
diagenetic process. Scientific reports, 4, 5559.
183. Zakany, J. and Duboule, D. 2007, The role of Hox genes during vertebrate limb
development. Current opinion in genetics & development, 17, 359-366.
184. Hansen, K. D., Timp, W., Bravo, H. C., et al. 2011, Increased methylation variation in
epigenetic domains across cancer types. Nat Genet, 43, 768-775.
185. Berman, B. P., Weisenberger, D. J., Aman, J. F., et al. 2012, Regions of focal DNA
hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear
lamina-associated domains. Nat Genet, 44, 40-46.
186. Xie, W., Schultz, M. D., Lister, R., et al. 2013, Epigenomic analysis of multilineage
differentiation of human embryonic stem cells. Cell, 153, 1134-1148.
150
187. Nakamura, R., Tsukahara, T., Qu, W., et al. 2014, Large hypomethylated domains serve
as strong repressive machinery for key developmental genes in vertebrates. Development, 141,
2568-2580.
188. Jeong, M., Sun, D., Luo, M., et al. 2014, Large conserved domains of low DNA
methylation maintained by Dnmt3a. Nat Genet, 46, 17-23.
189. Abu-Remaileh, M., Bender, S., Raddatz, G., et al. 2015, Chronic inflammation induces a
novel epigenetic program that is conserved in intestinal adenomas and in colorectal cancer.
Cancer research, 75, 2120-2130.
190. Kader, F. and Ghai, M. 2015, DNA methylation and application in forensic sciences.
Forensic science international, 249, 255-265.
191. Seguin-Orlando, A., Korneliussen, T. S., Sikora, M., et al. 2014, Paleogenomics. Genomic
structure in Europeans dating back at least 36,200 years. Science, 346, 1113-1118.
192. Hendrich, B. and Bird, A. 1998, Identification and characterization of a family of
mammalian methyl-CpG binding proteins. Molecular and cellular biology, 18, 6538-6547.
193. Brogaard, K., Xi, L., Wang, J. P. and Widom, J. 2012, A map of nucleosome positions in
yeast at base-pair resolution. Nature, 486, 496-501.
194. Crawford, G. E., Holt, I. E., Whittle, J., et al. 2006, Genome-wide mapping of DNase
hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res, 16,
123-131.
195. Tatarinova, T., Kryshchenko, A., Triska, M., et al. 2013, NPEST: a nonparametric method
and a database for transcription start site prediction. Quant Biol, 1, 261-271.
196. Troukhan, M., Tatarinova, T., Bouck, J., Flavell, R. B. and Alexandrov, N. N. 2009,
Genome-wide discovery of cis-elements in promoter sequences using gene expression. OMICS,
13, 139-151.
197. Menotti, F. and O'Sullivan, A. 2013, The Oxford handbook of wetland archaeology.
Oxford University Press: Oxford, United Kingdom.
198. Kharlamova, A., Saveliev, S., Kurtova, A., et al. 2016, Preserved brain of the Woolly
mammoth (Mammuthus primigenius (Blumenbach 1799)) from the Yakutian permafrost.
Quaternary International, 406, 86-93.
199. Zeng, J., Konopka, G., Hunt, B. G., Preuss, T. M., Geschwind, D. and Yi, S. V. 2012,
Divergent whole-genome methylation maps of human and chimpanzee brains reveal epigenetic
basis of human regulatory evolution. American journal of human genetics, 91, 455-465.
200. Chopra, P., Papale, L. A., White, A. T., et al. 2014, Array-based assay detects genome-
wide 5-mC and 5-hmC in the brains of humans, non-human primates, and mice. BMC Genomics,
15, 131.
151
201. Sudmant, P. H., Mallick, S., Nelson, B. J., et al. 2015, Global diversity, population
stratification, and selection of human copy-number variation. Science, 349, aab3761.
202. Karmin, M., Saag, L., Vicente, M., et al. 2015, A recent bottleneck of Y chromosome
diversity coincides with a global change in culture. Genome Res, 25, 459-466.
203. Leslie, S., Winney, B., Hellenthal, G., et al. 2015, The fine-scale genetic structure of the
British population. Nature, 519, 309-314.
204. ArunKumar, G., Tatarinova, T. V., Duty, J., et al. 2015, Genome-wide signatures of male-
mediated migration shaping the Indian gene pool. J Hum Genet, 60, 493-499.
205. Kushniarevich, A., Utevska, O., Chuhryaeva, M., et al. 2015, Genetic Heritage of the
Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-
Chromosomal Data. PloS one, 10, e0135820.
206. Yunusbayev, B., Metspalu, M., Metspalu, E., et al. 2015, The genetic legacy of the
expansion of Turkic-speaking nomads across Eurasia. PLoS Genet, 11, e1005068.
207. Flegontov, P., Changmai, P., Zidkova, A., et al. 2016, Genomic study of the Ket: a Paleo-
Eskimo-related ethnic group with significant ancient North Eurasian ancestry. Scientific reports,
6.
208. Elhaik, E., Greenspan, E., Staats, S., et al. 2013, The GenoChip: a new tool for genetic
anthropology. Genome Biol Evol, 5, 1021-1031.
152
Supplementary Table 3.1. Software packages, pipelines and tools suitable for aDNA analysis. The following
program codes are used: T – reads’ trimming, Q – quality control, M – reads’ mapping, C – variant calling, A –
admixture analysis, P – population analysis, S – transcription start site prediction, L – multiple sequence
alignment
Program name Code Short description Available at
AdapterRemova
l
T Searches and removes
remnant adapter
sequences from NGS
data
https://github.com/MikkelSchubert/adapterremoval
ADMIXTOOLS A A software package
that supports formal
tests of whether
admixture occurred,
and makes it possible
to infer admixture
proportions and
dates.
https://github.com/DReichLab/AdmixTools
ADMIXTURE A Software tool for
maximum likelihood
estimation of
individual ancestries
from multi-locus SNP
genotype datasets
https://www.genetics.ucla.edu/software/admixture/
aLib TQ Sequencing pipeline
for ancient and
modern DNA
https://github.com/grenaud/aLib
Anfo M Reads mapper similar
to
Soap/Maq/Bowtie use
ful when sample is
significantly divergent
from reference
https://bioinf.eva.mpg.de/anfo/
Arlequin P A tool for population
genetic data analysis
http://cmpg.unibe.ch/software/arlequin35/
BayesCall C Base calling algorithm
for NGS
http://bayescall.sourceforge.net/
BayeSSC P Tool for Bayesian
inference based on
coalescent
simulations
http://web.stanford.edu/group/hadlylab/ssc/index.html
BEAGLE C Software package that
performs genotype
calling, genotype
https://faculty.washington.edu/browning/beagle/beagle.
html
153
phasing, imputation of
un-genotyped
markers, and identity-
by-descent segment
detection
BWA-PSSM M Probabilistic short
read aligner
http://bwa-pssm.binf.ku.dk/
CutAdapt T Tool to find and
remove adapter
sequences
https://cutadapt.readthedocs.org
DnaSP P Tools for DNA
sequence
polymorphism data
within and between
populations plus tests
of neutrality
http://www.ub.edu/dnasp/
FastQC Q Quality control tool for
high throughput
sequence data
http://www.bioinformatics.babraham.ac.uk/projects/fast
qc/
FreeIbis C An efficient base caller
for Illumina
sequencers with
calibrated quality
scores
https://github.com/grenaud/freeIbis
GenePop P A tool for population
genetic data analysis
http://genepop.curtin.edu.au/
GPS P Sample Provenance
Predictor
http://chcb.saban-chla.usc.edu/gps/
HAPMIX A Software for
identifying ancestry
segments in admixed
individuals
http://www.stats.ox.ac.uk/~myers/software.html
leeHom T Bayesian
reconstruction of
ancient DNA
fragments
https://github.com/grenaud/leehom
MEGA P A tool for molecular
evolutionary genetic
analyses (phylogenetic
trees, distances,
substitution models,
etc.)
http://www.megasoftware.net/
MIA MC Consensus calling (or
"reference assisted
https://github.com/mpieva/mapping-iterative-assembler
154
assembly"), chiefly of
ancient mitochondria
MRBAYES P Bayesian inference of
phylogeny
http://mrbayes.sourceforge.net/
MUSCLE L Multiple sequence
alignment
http://www.ebi.ac.uk/Tools/msa/muscle/
NETWORK P Parsimony method to
construct phylogenetic
trees and networks
http://www.fluxus-engineering.com/sharenet.htm
NGSADMIX A Tool for finding
admixture proportions
from NGS data
http://www.popgen.dk/software/index.php/NgsAdmix
NPEST S de-novo TSS
prediction
http://link.springer.com/article/10.1007%2Fs40484-013-
0022-2
http://chcb.saban-chla.usc.edu/NPEST/NPEST.gz
PALEOMIX TQMC
User-friendly package
for largely automates
the analyses related to
whole genome re-
sequencing.
https://github.com/MikkelSchubert/paleomix
PAML P A tool for maximum
likelihood
phylogenetic tree
reconstruction
http://abacus.gene.ucl.ac.uk/software/paml.html
PAUP P Tools for phylogenetic
reconstruction
including distance
methods, parsimony
and maximum
Likelihood
http://paup.csit.fsu.edu/
PUZZLE P A maximum likelihood
method for inferring
phylogenetic trees
http://www.tree-puzzle.de/
RAXML P A method for
maximum likelihood
phylogenetic tree
reconstruction
http://sco.h-
its.org/exelixis/web/software/raxml/index.html
reAdmix P Prediction of
provenance for
individuals of recently
admixed origin
http://chcb.saban-chla.usc.edu/reAdmix/
155
SABER A Reconstructing genetic
ancestry blocks in
admixed individuals
http://med.stanford.edu/tanglab/software/saber.html
Schmuzi QC Estimation of
contamination and
consensus calling for
ancient mitochondrial
DNA
https://bioinf.eva.mpg.de/schmutzi/
SeqPrep T Stripping adaptors
and/or merging paired
reads with overlap
into single reads
https://github.com/jstjohn/SeqPrep
TFPGA P Tools for population
genetic analysis of
allozymes and other
molecular markers
http://www.marksgeneticsoftware.net/tfpga.htm
TSSer S de-novo TSS
prediction
http://online.liebertpub.com/doi/abs/10.1089/omi.2008.
0034
http://chcb.saban-chla.usc.edu/TSSer/TSSer.gz
156
Supplementary Table S 3.2 is available in .xlsx format in electronic version of this dissertation.
157
Chapter 4. Inferring the effective number of
stem cells, somatic mutation rate and
cellular dynamics of regeneration from
multi-generational genomic data in asexual
planaria
Summary
Planarian flatworms have recently regained much attention due to their extraordinary
regeneration capacity. Their clonal mode of reproduction along with high genetic heterogeneity
among the cells within an individual worm’s body make these organisms very appealing models
to study parameters such as somatic mutation rate and intraorganismal cell-lineage selection,
too. Due to lack of transgenes, heterogeneity of stem cell populations with respect to molecular
markers and the inability to distinguish stem cells from their early committed progeny by
molecular markers or dying experiments, several important aspects of body regeneration in
these model systems remain unclear, including the number of stem cells actively contributing
to regeneration at each reproductive cycle. In this study we sampled a pedigree of planarian
worms of the species Dugesia gonocephala for 16 generations; and generated genomic
sequences from every other generation (a total of 8 samples). After each fission, the tail piece
was used for sequencing and the head piece was allowed to grow and further the line. In the
analysis, we modeled each cell as an individual and the body of a worm as a population that
goes into cycles of splitting into two nearly equal subpopulations each of which would grow to
restore the original size. From temporal variance in allele frequencies (magnitude of genetic
drift), we observed that the effective number of stem cells can vary substantially across
158
reproductive cycles, ranging from 10 to 900 with the harmonic mean around 20. Average
genomic nucleotide diversity was 𝜋 = 1.75 × 10
−4
. Assuming neutral evolution, the somatic
mutation rate was estimated as 𝜇 =
𝜋 2𝑁 𝑒 = 4.55 × 10
−6
. Treating the whole genome as a single
locus (asexual species, no recombination), Tajima’s D was estimated to be 𝐷 = 2.39. This is
consistent with strong balancing selection, structured inheritance of cell linages or two
genetically diverged sets of homologous chromosomes.
Key words: Planaria, Dugesia gonocephala, stem cells, regeneration, effective population size,
somatic mutation rate
159
4.1. Introduction
In addition to being the specific name of the genus Planaria, planaria or the planarians is also a
general term that refers to the free-living flatworms of the order Tricladida (phylum
Platyhelminthes, class Rhabditophora) consisting of hundreds of terrestrial and aquatic species
(Álvarez-Presas, Baguñà, & Riutort, 2008). Their unusually high potential to reconstruct full
bodies after injury and even from small body pieces has been known for more than a century,
and many aspects of its mechanism have been researched and discovered. Planarians can grow
or degrow depending on the nutritional conditions, which is a consequence of change in cell
number rather than cell size (Newmark & Alvarado, 2002). Both addition of food and infliction
of an injury can elicit a response in the form of increased M-phase (mitotic) cells in a matter of
6-8 hours, which regresses to the baseline after a few days (Rink, 2013). Growth and
regeneration rely mainly on a very large number of stem cells (Neoblasts) which, by
morphological observations, comprise 25-30% of the cells in an animal’s body. Neoblasts are
the only dividing cells in Planaria (Reddien & Alvarado, 2004). They can undergo asymmetric
division to produce one stem cell and one progenitor cell committed to differentiation, or
symmetric division to produce either two stem cells or two committed cells. The rates and
ratios of these division types are regulated to maintain the desired number of stem cells in the
body during natural clonal expansion. Also, the process of recovery after partial irradiation
which kills most of the stem cells restores of the original number of stem cells, which suggests a
cell population-level mechanism to control the proportion of stem cells in the body (Wagner,
Ho, & Reddien, 2012); too many stem cells might cause cancer and too few would result in
reduced tissue turn-over and pre-mature ageing (Rink, 2013). Planarians not only regulate stem
cell divisions, but also program destruction of aged somatic cells to achieve their high rate of
tissue turn-over, and, the phenomenon of degrowth when necessary (Rink, 2013). Neoblasts
are scattered through most of the worm’s body parts. Pharynx and the area anterior to
photoreceptors lack Neoblasts and must therefore rely on stem cell migration for tissue
turnover or regeneration. BrdU labeling experiments (to capture mitotically active cells) show
that the migration of stem cells or early progenitors during wound repair is an active process
160
(Newmark & Alvarado, 2002; Rink, 2013). They also show that a small percentage of stem cells
are active at any time; in an intact animal, around 6% of Neoblasts are labeled with BrdU soon
after a single injection. The state-of-the-art knowledge and the main challenges in planarian
research have been summarized in a number of excellent reviews (Reddien & Alvarado, 2004;
Rink, 2013). Among the unknowns is the exact number of active stem cells during regeneration.
Due to the heterogeneity of stem cells in terms of morphological and cellular features, as well
as the overlap between them and those of early post-mitotic committed progeny cells,
methods based on molecular markers such as PCNA or H3P (Histone 3 phosphorylated at Ser10)
or BrdU positive cells are not guaranteed to visualize stem cells exclusively (Newmark &
Alvarado, 2002; Rink, 2013). Another pitfall of most of the current methods is that they involve
BrdU injection, experimental wounding, or irradiation which may alter the natural physiology of
the animal significantly. For example, it was demonstrated that when the anterior half of the
body is irradiated and its tip is amputated, stem cells migrate long distances to replenish the
pool of Neoblasts in the irradiated half and reconstruct the amputated part. Without
amputation, however, the irradiated half would become necrotic and die. Wound seemed
necessary to signal the extensive migration of stem cells (Reddien & Alvarado, 2004). So the
question is: how comparable is the response to natural fission to that to an experimental
wound? Another interesting central question in planarian research has been whether there is
only one generalist stem cells type that produces all tissue types, or there are subtypes each
specialized to produce certain tissues and organs (Reddien, 2013). By transplanting a single
stem cell into a completely irradiated animal, (Wagner, Wang, & Reddien, 2011) showed
evidence in favor of the single generalist type hypothesis. However, this does not definitively
clarify the cellular inheritance and activity patterns of stem cells under natural conditions.
Apart from being very interesting models to understand the molecular mechanisms of
regeneration, Planarians are also a curious case for evolutionary genetic studies. Ploidy number
and mode of reproduction (sexual as cross-fertilizing hermaphrodites or fissiparous asexual) can
vary greatly even among members of a single genus (Lázaro et al., 2009). In nature, asexual
reproduction is accomplished by a transverse fission of the worm’s body producing a head
161
piece and a tail piece, after which each piece grows to a new complete worm through mitotic
cell divisions. This process can continue indefinitely. Because planarians do not undergo the
single cell bottleneck like sexually reproducing species, a significant level of genetic
heterogeneity is expected to exist within a body (Nishimura et al., 2015). In addition to possible
physiological effects of this mosaicism, accumulation of somatic mutations and selection among
cell lineages brings back one of the fundamental questions of evolution: why doesn’t
competition among biological units at a lower level of organization disrupt their cooperation at
a higher level? (Szathmáry & Maynard Smith, 1995). The essence of the answer lies in genetic
similarity of units within a compartment (e.g. cells within one individual’s body) and their
divergence from units in other compartments (cells of other individuals). In sexual reproducers
only germ cells transmit their progeny to future generations, but somatic cells “do not revolt”
because they have the same hereditary material as the germline (equivalent to the fertile
queen and male bees vs. sterile worker bees). But in the case of Planarians (and many other
clonal species), the high level of genetic heterogeneity among cells within a body renders the
argument of relatedness weak, and hints at potentially consequential somatic cell competition.
Somatic cells and the mutations they carry will be transmitted to the next generation and the
rate of mitotic divisions will be directly proportional to contribution to the next generation.
Also, mutator alleles will accompany their induced mutations because there is no
recombination. So, several interesting questions can be asked: What prevents all worm cell
lines from becoming cancerous? How does this dynamic affect the optimal somatic mutation
rate? Are the asexually reproducing worms in a genetic steady-state in terms of variation, or do
they go through cycles of variation depletion due to asexual reproduction and restoration of
variation by occasional sexual reproduction? How does this dynamic affect the genomic targets
of natural selection and the rate of DNA sequence evolution?
In this study, we addressed some of the above questions using temporally spaced (multi-
generational) genomic sequence data. We followed a pedigree of lab-reared planarian worms
of species Dugesia gonocephala for 16 generations and sequenced genomic libraries created
from half bodies every other generation (a total of 8 samples). In our analysis, we treated each
162
cell as an individual and each worm body as an asexually reproducing population. We took
advantage of the high level of genetic heterogeneity within worm bodies and used temporal
variance of allele frequencies across generation to estimate the effective number of stem cells
(the only dividing cells). From the inferred level of nucleotide diversity and the famous 𝜋 =
2𝑁 𝑒 𝜇 equation, we calculated a rough estimate of somatic mutation rate. Finally, we calculated
a genomic average value of Tajima’s D based on which we generated interesting hypotheses
regarding genetic divergence of homologous chromosomes, body structure in these worms and
the dynamics of the regeneration process. The biggest advantage of our approach is that it
provides information without any intervention on living worms, so there is no question of
altering the natural physiology of worms because of the experimental wounding, irradiation or
injection. Furthermore, the genetic method of estimating effective population size provides a
measure of number of stem cells inherently normalized for their relative activity and
contribution to body growth, in contrast with molecular markers, some of which label stem
cells regardless of activeness and other label committed progenitor cells as well as stem cells.
163
4.2. Methods
4.2.1. Worm samples and sequencing
Dugesia was collected from a stream in Almese, Italy on September 23, 2009. The planaria
were separated into individual vials and kept in standard rearing conditions and fed beef liver
once a week followed by water exchanges. Eight times every other generation, tail pieces were
frozen for sequencing and head pieces were allowed to grow and divide to further the worm
lineage (Figure 4.1). Genomic libraries were then produced from frozen tail pieces according to
the protocol in (Dunham & Friesen, 2013) and sequenced on Illumina HiSeq.
Figure 4.1. Schematic of Dugesia lineage progression and sampling regime
4.2.2. Pre-processing and SNP calling
Illumina reads are trimmed and adaptors are removed using trimmomatic (Bolger, Lohse, &
Usadel, 2014) version 0.32 with default options. Since there is no reference genome for our
model species we used a reference-free SNP calling method called discoSNP (Uricaru et al.,
2015) version 2.2.1. discoSNP is run using the options -b 0 -D 100 -P 1 -k 31 -c auto -C 2^31-1 -d
164
1. The Fasta file generated by discoSNP was processed using a custom Python script to generate
the allele frequencies that are used for the downstream analysis.
4.2.3. Theoretical background and population genetic analyses
4.2.3.1. Basic theory
Population genetic theory teaches us that the magnitude of genetic drift (temporal variance in
allele frequency) is inversely proportional to population size. The existence of significant drift
can be ascertained by a simple heterogeneity 𝜒 2
test of allele counts across generations
(Waples, 1989b). The theory for estimating effective population size from allele frequency
changes over time was derived decades ago (Pollak, 1983; Waples, 1989a). The main concern at
that time was scarcity of empirical data and the unappealingly large confidence intervals of
estimates based on small sample size or a small number of sequenced loci. Today, however,
genome-scale data provides information on thousands of loci and pooled sequencing offers a
very cost-effective option to estimate collective population level parameters such as allele
frequencies (Gautier et al., 2013). Despite development of more complicated likelihood-based
methods, the classical moment-based approach of Pollak and Waples still works comparatively
well given that sufficient number of loci (>20) are sequenced from a large enough sample (>30)
and minor alleles are not particularly rare (Berthier et al., 2002). In the moments method, an F
statistic is calculated which reflects the change in allele frequency across generations. Effective
population size is then estimated based on this F, sample size at the two sampled generations
and the time (in terms of generations) elapsed between them:
𝐹 ̂
𝑘 =
(𝑝 𝑖 −𝑝 𝑗 )
2
𝑝 𝑖 +𝑝 𝑗 2
−(
𝑝 𝑖 +𝑝 𝑗 2
)
2
(equation 1)
𝐸 (𝐹 ̂
𝑘 ) =
1
𝑆 𝑖 + 1 − (1 −
1
𝑁 𝑒 )
𝑡 (1 −
1
𝑆 𝑗 ) (equation 2)
Notation: 𝑝 𝑖 , 𝑝 𝑗 : allele frequency at generations i and j; 𝑆 𝑖 , 𝑆 𝑗 : sample size at generations i and j;
𝑡 : number of generations between sampling points i and j; 𝑁 𝑒 : effective population size.
165
Notice that equation 1 is insensitive to the choice of allele at a biallelic locus.
4.2.3.2. Application to our system
We treat each cell like an individual and the body of a worm like a population. At each sampled
generation, we produce genomic sequences (thousands of polymorphic loci) with variable
coverage across loci. Sequencing coverage in this setting equals the number of (haploid)
chromosomes represented at each position, and is equivalent to sample size for that locus in
the familiar individual sampling scheme. In this sense, our dataset is a hybrid between
individual sequencing data and pooled-seq data. Because PCR duplicates are removed, each
chromosome in the body, if represented at a locus, contributes exactly one read. So there is no
issue of randomly unequal contribution of different individuals as in regular pooled-seq data.
On the other hand, reads at different positions come from different cells and chromosomes in
the worm’s body; a situation that is not observed when a fixed set of individual organisms are
sampled and genotyped at all loci. This is a very appealing feature in one particular sense:
unlike the case of sequencing fixed individuals from a population, the assumption of
independence of loci is factually correct here because reads containing different SNPs originate
independently from different chromosomes.
Since differentiated somatic cells do not divide in planarians, effective population size is
restricted to the fraction of active stem cells in the body. So unlike the classical concept of
effective size, in this case, there is a real biological delineation between effective members and
the rest of the population. Only head pieces are allowed to grow to further the worm line, so
𝑁 𝑒 here can be interpreted as the number of active stem cells in the anterior half of the worm.
Conversely, only tail pieces are sequenced. This means that there is no overlap (covariance)
between the sequenced sample and the effective population members which contribute to the
next generation i.e. our experimental setting corresponds to sampling plan 2 as described in
(Waples, 1989a).
166
For our dataset, we calculated F between all pairs of generations. F was calculated at each locus
and then averaged over loci. The harmonic means of sequencing coverages across loci were
used as corresponding sample sizes. We remove PCR duplicates, so we know that each read
comes from a unique chromosome in the worm’s body. Although t in equation 2 should be
precisely equal to cell generations (remember: cells are individuals in our model), we
performed the calculations as if it is equal to organismal generations. In the Discussion section,
we explain the scenarios in which this assumption is completely accurate, and suggest a simple
correction factor when it is not.
167
4.3. Results
An average of ~380 million reads were recovered from each of the eight samples. Using the
described reference-free method, 362,871 biallelic SNPs were discovered. Median sequencing
coverage at the polymorphic positions was 22-25X per sample.
4.3.1. Evidence for existence of drift in the sequence data
To make an initial assessment of whether the dataset contains sufficient informative sites
experiencing drift to enable estimation of 𝑁 𝑒 from temporal variance, 100 SNPs were randomly
chosen. For each SNP, a 2x8 contingency table was constructed tabulating the counts of
reference and alternative alleles across the 8 samples. A significant result in the corresponding
𝜒 2
heterogeneity test means that sampling error alone is not enough to explain the fluctuations
in the allele counts and a biological process (drift) must be invoked to account for it (Waples,
1989b). Of the 100 positions examined, 54 positions showed significant test results at p<0.05
and 45 showed significant results at FDR<0.05 (Supplementary figure 4.1). This outcome
justified the feasibility of temporal variance method for estimation of effective population size
of cells.
4.3.2. Effective population size of stem cells
Both likelihood methods and moment-based methods yield reasonably accurate estimates
based on information from tens of loci. To ensure the accuracy of our estimate, we subsampled
10000 polymorphic loci from the dataset. We removed positions with sequencing coverage
outside the range 10-40X in any of the eight sampled generations. Simulations demonstrated
that extremely rare alleles yield biased estimates (Berthier et al., 2002; Waples, 1989a, 1989b).
Thus, we removed loci with initial allele frequencies <0.1 or >0.9 as well. After applying these
two filters, 3569 loci were retained. Effective population size could be estimated using any pair
of generations. We performed the estimation using all 28 possible pairs (C(8,2)=28). For each
168
pair of generations, we computed the average F-statistic across all 3569 loci according to
equation 1. S i and S j were set to the harmonic means of sequencing coverages across loci at the
earlier and later generations, respectively. Time (t) was set to be the organismal generation
between the two samples in question (the first and second sample were two generations apart,
the first and third samples four generations apart, and so on). Figure 4.2 and supplementary
table 4.1 present the results. Supplementary table 4.1 demonstrates that the estimate of 𝑁 𝑒
grows almost linearly with the ratio of cell generations to organismal generation; if there are 𝑔
cell generations in each organismal generation, each estimate of 𝑁 𝑒 presented here should be
multiplied by 𝑔 . The 𝑁 𝑒 estimates from generation pairs (1,2), (3,4) and (7,8) are much larger
than all others. Significantly stronger correlation of allele frequencies between these particular
pairs of consecutive generations (data not shown) confirms that this is not a computational
artifact.
Figure 4.2. Estimates of the effective size of the population of stem cells based on pairs of
generations. Numbers on the Y axis correspond to the pairs of samples used for estimation.
Remember that consecutive samples (e.g. 2,3) are two generations apart and the first and last
samples (1,8) are 14 generations apart.
169
4.3.3. Nucleotide diversity and somatic mutation rate
The reference-free SNP calling method that we used does not provide an estimate of the
proportion of polymorphic sites. Therefore, we first calculated a proxy, 𝜋 ∗
as the average
nucleotide diversity per polymorphic position. For the purpose of this calculation, we started
from the same 10000 loci subsampled before, but only filtered them for coverage 10-40X
(filtering out extreme allele frequencies would inflate the estimate of nucleotide diversity).
Figure 4.3. depicts the change of this parameter across generations. Remarkably, the biggest
drop in the value of 𝜋 ∗
(from generation 6 to 7) coincided with the smallest estimate of 𝑁 𝑒 ;
however, such agreement was not observed consistently across all generations. Linear
regression analysis revealed a significant but very slight decrease in 𝜋 ∗
over time (𝛽 =
−0.00173 , 𝑝 = 0.0115 ).
Figure 4.3. Trend of 𝝅 ∗
across the eight sampled generations
The actual value of 𝜋 could be computed by multiplying 𝜋 ∗
by the proportion of polymorphic
sites in the genome. We tried to estimate this proportion by dividing the discovered number of
SNPs by the genome size. There is no genome data for D. gonocephala; but the genome size
was estimated to be ~0.9Gb for a congeneric species Dugesia japonica (Nishimura et al., 2015)
and ~0.8Gb for the asexual biotype of the better studied planarian Schmidtea mediterranea
170
(http://www.ncbi.nlm.nih.gov/assembly/GCA_000572305.1). Using the D. japonica genome size
and the mean of 𝜋 ∗
across the 8 samples:
𝜋 = 𝜋 ∗
×
362871
0.9×10
9
= 0.433 × 4.032 × 10
−4
= 1.746 × 10
−4
(equation 3)
The next step is estimating somatic mutation rate. To do this, we first calculate the harmonic
mean of 𝑁 𝑒 s obtained from consecutive pairs of generations, and then substitute it into the
𝜋 = 2𝑁 𝑒 𝜇 equation:
𝑁 𝑒 =
1
1
7
∑
1
𝑁 𝑒 𝑖 , 𝑖 +1
7
𝑖 =1
= 19.180 (equation 4)
μ =
π
2N
e
=
1.746×10
−4
2×19.18
= 4.552× 10
−6
(equation 5)
If instead of the harmonic mean, we directly use the 𝑁 𝑒 obtained from sampled generations 1
and 8 (68.3), we will get 𝜇 = 1.278 × 10
−6
. It is still of the same order of magnitude; however,
the harmonic mean is calculated based data from eight generations, not two, and is therefore a
more representative steady-state measure in terms of the effect of drift on losing or preserving
genetic variants and a more reasonable choice for using in equation 5.
4.3.4. Tajima’s D
Asexual planarians lack recombination, so the whole genome can be treated as a single giant
locus. We calculated Tajima’s D using the basic formula (Tajima, 1989):
𝐷 =
𝑑 √𝑉 (𝑑 )
=
𝜃 𝜋 −𝜃 𝑊 √𝑒 1
𝑆 +𝑒 2
𝑆 (𝑆 −1)
=
𝜃 𝜋 −𝑆 /𝑎 1
√𝑒 1
𝑆 +𝑒 2
𝑆 (𝑆 −1)
(equation 6)
171
In equation 6, 𝜃 𝜋 is nucleotide diversity. Since this is a genome-wide value, we must multiply
the per-position value obtained in equation 3 by genome size:
𝜋 𝐺 = 1.746 × 10
−4
× 0.9 × 10
9
= 1.5714× 10
5
(equation 7)
The parameter 𝜃 𝑊 is Watterson’s estimator and depends on the number of polymorphic loci.
As D is supposed to be computed from data observed in one generation, we checked what
proportion of all discovered SNPs are polymorphic at each single generation. This proportion
was 97.4-98.4% (average ~98%); so we corrected the actual number of SNPs by multiplying it by
0.98:
𝑆 = 362871× 0.98 = 355613 .58 (equation 8)
The 𝑎 1
, 𝑒 1
and 𝑒 2
parameters are derived from sample size. Sample size in the dataset from
which 𝜋 was calculated was ~20 (the harmonic mean of coverages).
𝑎 1
= 3.54774 ; 𝑒 1
= 0.024396 ; 𝑒 2
= 0.0045084 (equation 9)
𝐷 𝐺 =
1.5714×10
5
−355613.58/3.54774
√0.024396(355613.58)+0.0045084(355613.58)(355612.58)
= 2.383 (equation 10)
This highly positive value of Tajima’s D is consistent with the observed allele frequency
spectrum being highly enriched in intermediate-frequency alleles (Figure 4.4).
172
Figure 4.4. Allele frequency spectrum in three out of the eight sequenced samples. Notice the unusual abundance of
positions with frequencies close to 0.5. The assignment of reference vs. alternative alleles is arbitrary; therefore, although
depicted in the range of 0-1, this is technically a folded spectrum. The five other samples had very similar shapes.
173
4.4. Discussion
We interpreted the significance of the 𝜒 2
heterogeneity test as evidence of genetic drift
(Supplementary figure 4.1). The truth is that the extra variance in addition to the effect of
sampling error might have been due to other biological factor such as selection. In fact, this test
had been used to infer selection by some authors before Waples pointed out that drift is a
more basic default process in all finite populations that needs to be considered as a source of
this variation (Waples, 1989b). We have two reasons to believe in the effect of drift: first, visual
inspection of allele frequency change patterns in all the 100 loci did not hint at any strong
increasing or decreasing trends; second, our subsequent analyses did confirm small effective
population sizes of cells.
Our model is based on discrete generations, but we know cells do not divide or die
synchronously. There are methods for estimation of effective population size from overlapping
generation models (Jorde & Ryman, 1995; Waples & Yokota, 2007). However, application of
these methods often requires knowledge of age-specific birth and death rates which are not
available for our system. Given long intervals between two points of sampling, discrete
generation models are shown to be adequate for overlapping generation populations but this
does not apply to our sampling scheme of every other generation. The bias in the application of
discrete generation models to overlapping generation populations stems from different allele
frequency profiles of cohorts samples at specific ages. Our sequencing dataset comes from
whole half-bodies of worms and is not presumably biased towards any specific cell ages.
Another mitigating factor is the high rate of tissue turnover and lack of long-living cells in these
worms, which means that there will not be many residual cells from the previous generations in
the sampled tissues (Rink, 2013).
We have used organismal generation instead of cell generations in our model. In the discrete
generation model, each generation is marked by selection of a number of individuals to be
174
parents of the next generation. In our system, this translates into activation of stem cells to
contribute to body growth. If the decision as to which stem cells are being activated is taken
once during each reproductive cycle, and is reset before the next cycle, the correspondence of
cell and organismal generations is accurate. If, however, stem cells become activated and
deactivated more or less frequently than that, our estimates will be systematically biased.
Fortunately, the correction is simple. The estimate of 𝑁 𝑒 can be simply multiplied by 𝑔 (the
number of cell generations per organismal generation).
The analytical model used to estimate 𝑁 𝑒 assumes random selection of effective members of
the population to produce the next generation (i.e. no body structure). In our system, this
means random activation of stem cells at each cycle regardless of their position in the body.
However, there might be a mechanism to ensure activation of stem cells uniformly across the
body, or with a probability inversely proportional to their distance from the fission site (Figure
4.5a). This might lead to lower or higher relatedness among the active stem cells compared to
what is expected based on the random model. There are two observations indicating that this
random activation (no body structure) model is reasonably plausible, although they do not
prove that it happens in reality. First, is the fact that after experimental amputation or natural
fission, and also during the processes of growth and degrowth, the worm body often undergo
extensive reshaping (known as morphallaxis) (Reddien & Alvarado, 2004). This means that the
body structure may not be preserved from generation to generation, and therefore a random
activation model may be correctly used for the purpose of estimating effective population size.
Second, is the observation from irradiation-amputation experiments, that stem cells or their
progeny cells can migrate long distances from their original position to the wound site to
contribute to regeneration (Reddien & Alvarado, 2004; Rink, 2013). Thus, the reconstruction of
the body after fission need not be restricted to stem cells adjacent to the fission site. Also, stem
cells can produce the same number of progeny differentiated and stem cells by various
schemes in terms of symmetric and asymmetric divisions (Figure 4.5b). These schemes will have
different effects on allele frequency variance.
175
a)
b)
Figure 4.5. Alternative models of stem cell activation and division. a: stem cells contributing to regeneration can be selected
randomly (left), uniformly (center), or based on proximity to fission site (right). b: Supposing that in the worm body each
stem cell is associated with n (here, 4) somatic cells, and during the growth phase they are doubled (A) various scenarios are
imaginable to achieve this in terms of number and sequence of symmetric and asymmetric divisions (e.g. B-D).
Our estimate of the effective number of stem cells is orders of magnitude smaller than the
number proposed by microscopic detection based on molecular markers. It has been suggested
that 20-30% of cells in a worm’s body are stem cells (Baguñà, Saló, & Auladell, 1989; Elliott &
Sánchez Alvarado, 2013; Lobo, Beane, & Levin, 2012). A 1-cm long Dugesia was estimated to
contain approximately 2 × 10
5
cells (Montgomery & Coward, 1974). This gives an estimate of
40000-60000 stem cells which is much higher than our estimate of 10-900. This means that
probably a small fraction of stem cells contributes substantially to regeneration. Also it should
be noted that we estimated the number of active stem cells in the head piece, which is known
to have fewer stem cells than the tail piece, with the area anterior to eyes practically devoid of
176
any stem cells. We also observed large variability in 𝑁 𝑒 from generation to generation. This is
most likely due to there being different places of fission along the worm’s body. A fission line
very close to the anterior end would mean fewer stem cells in the head piece and vice versa.
Many cues can instigate the fission of an asexual planarian: population density, life and dark
cycles, temperature and size of the animal (Newmark & Alvarado, 2002). A smaller worm at the
time of fission would contain correspondingly fewer stem cells. It has been shown that
transplanting even a single stem cell into the body of an irradiated worm can fully restore the
regeneration capacity (Wagner et al., 2011). This confirms that our small 𝑁 𝑒 estimate is not
biologically implausible.
Our estimate for somatic mutation rate is considerably higher than the common rate which is in
the order of 10
-9
to 10
-8
. Before discussing the significance of this figure we need to examine
the reliability of our estimate of the proportion of polymorphic sites based on the number of
SNPs and the genome size. The discoSNP method we used for SNP calling was shown to have
recall success of 79.2-97.3% and precision of 92.3-98.8% on simulated datasets (Uricaru et al.,
2015). We used the genome size reported for D. japonica (0.9Gb). The reference genome of the
better studied Schmidtea mediterranea is also 0.8Gb. Therefore, we believe that our estimate
of 𝜋 is reliable at least at an order-of-magnitude scale. Remarkably, theoretical work has shown
that in the presence of strong selectin among somatic cell lineages, a higher optimal mutation
rate can evolve; because mutator alleles can benefit from the advantageous mutations they
cause while the deleterious mutations will be eliminated before they get the chance to be
transmitted to next generation (S P Otto & Hastings, 1998; S. P. Otto & Orive, 1995). The high
level of somatic heterogeneity, also observed by (Nishimura et al., 2015), can provide the basis
for such strong somatic selection.
A positive value of Tajima’s D (~2.4 in our dataset) is indicative of long coalescent branches and
is often interpreted as evidence of balancing selection. On a genomic scale however, balancing
selection for all or the majority of loci is unlikely. In the case of our dataset, a more plausible
177
hypothesis is co-existence and co-inheritance of two or more genetically diverged cell clones,
perhaps due to existence of specialized stem cell types. A single peak around allele frequency
0.5 is completely the opposite of expectation from neutral theory and strongly suggests two
clones. The unusual AFS observed here does not seem to be a computational artifact; discoSNP
is not particularly biased against rare alleles (Uricaru et al., 2015). Based on transcriptome
analysis, existence of two types of stem cells (called sigma and zeta) has been suggested (Van
Wolfswinkel, Wagner, & Reddien, 2014), although the two classes have been suggested to
share recent ancestry. Another intriguing possibility is the co-existence of two genetically
diverged sets of homologous chromosomes due to a long history of clonal reproduction and
lacking meiotic recombination. It must be mentioned that even in clonal organisms, gene
conversion due to repair mechanisms and mitotic recombination can in principle homogenize
the two homologs to some extent.
The next step in this project will involve the comparison of our current analytical estimates with
those from simulation-based methods. One of the reported superiorities of likelihood based
methods is smaller confidence intervals with few loci and less bias with rare alleles (Berthier et
al., 2002). We used a filtered set of >3500 loci none of which had extremely rare alleles (all AF
were between 0.1-0.9). The estimate of F was robust to thresholds for allele frequency and
coverage, too (results not shown; estimates did not vary by more than 10-15% of the average).
There are, however, two major advantages which encourage us to employ the simulation-based
approaches. First, it is possible to co-estimate parameters such as effective population size,
mutation rate and AFS shape metrics (e.g. Tajima’s D). Considering the interpretation from a
positive genome-wide Tajima’s D, it may not have been accurate to use the 𝜋 = 2𝑁 𝑒 𝜇 equation
which is derived based on neutral evolution. Similarly, the equations to estimate 𝑁 𝑒 depend on
random activation of stem cells during regeneration while a positive Tajima’s D suggests some
form of structured inheritance of cell clones. The second advantage which is not unrelated to
the first one is that simulations allow testing a much wider variety of scenarios for which
deriving analytical results can prove difficult or impossible (e.g. Figure 4.5a,b). Thus, the next
step in this project will be applying Approximate Bayesian Computation (ABC) for co-estimation
178
of the parameters of interest under biologically relevant scenarios. Nevertheless, we believe
that making the initial attempt to do the estimations analytically, has been very informative
towards the future simulations. Importantly, analytical estimates for both effective size and
mutation rate have suggested a very different range of values along which testing and
optimizing should take place; they have indicated the kind of biological scenarios that need to
be considered and a set of summary statistics for the ABC acceptance/rejection step.
The current effort demonstrated the effectiveness of population genetic theory in connecting
cellular-level data to organismal features by modeling cells as individuals and bodies as
populations of cells. Hopefully, this approach will make population genetics a favorite data
analysis tool when single cell sequencing becomes a mainstream practice.
179
References
Álvarez-Presas, M., Baguñà, J., & Riutort, M. (2008). Molecular phylogeny of land and
freshwater planarians (Tricladida, Platyhelminthes): From freshwater to land and back.
Molecular Phylogenetics and Evolution, 47(2), 555–568.
http://doi.org/10.1016/j.ympev.2008.01.032
Baguñà, J., Saló, E., & Auladell, C. (1989). Regeneration and pattern formation in planarians
\nIII. Evidence that neoblasts are totipotent stem cells and the source of blastema cells.
Development, 107, 77–86.
Berthier, P., Beaumont, M. A., Cornuet, J.-M., Luikart, G., Anderson, E. C., Williamson, E. G., …
Wright, S. (2002). Likelihood-based estimation of the effective population size using
temporal changes in allele frequencies: a genealogical approach. Genetics, 160(2), 741–51.
http://doi.org/10.1038/hdy.1996.110
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina
sequence data. Bioinformatics, 30(15), 2114–2120.
http://doi.org/10.1093/bioinformatics/btu170
Dunham, J. P., & Friesen, M. L. (2013). A cost-effective method for high-throughput
construction of Illumina sequencing libraries. Cold Spring Harbor Protocols, 9, 820–834.
http://doi.org/10.1101/pdb.prot074187.A
Elliott, S. A., & Sánchez Alvarado, A. (2013). The history and enduring contributions of
planarians to the study of animal regeneration. Wiley Interdisciplinary Reviews:
Developmental Biology, 2(3), 301–326. http://doi.org/10.1002/wdev.82
Gautier, M., Foucaud, J., Gharbi, K., Cézard, T., Galan, M., Loiseau, A., … Estoup, A. (2013).
Estimation of population allele frequencies from next-generation sequencing data: pool-
versus individual-based genotyping. Molecular Ecology, 22(14), 3766–79.
http://doi.org/10.1111/mec.12360
Jorde, P. E., & Ryman, N. (1995). Temporal allele frequency change and estimation of effective
180
size in populations with overlapping generations. Genetics, 139(2), 1077–1090.
Lázaro, E. M., Sluys, R., Pala, M., Stocchino, G. A., Baguñà, J., & Riutort, M. (2009). Molecular
barcoding and phylogeography of sexual and asexual freshwater planarians of the genus
Dugesia in the Western Mediterranean (Platyhelminthes, Tricladida, Dugesiidae).
Molecular Phylogenetics and Evolution, 52(3), 835–845.
http://doi.org/10.1016/j.ympev.2009.04.022
Lobo, D., Beane, W. S., & Levin, M. (2012). Modeling planarian regeneration: A primer for
reverse-engineering the worm. PLoS Computational Biology, 8(4).
http://doi.org/10.1371/journal.pcbi.1002481
Montgomery, J. R., & Coward, S. T. (1974). On the Minimal Size of a Planarian Capable of
Regeneration. Transactions of the American Microscopial Society, 93(3), 386–391.
Newmark, P. A., & Alvarado, A. S. (2002). Not Your Father’S Planarian: a Classic Model Enters
the Era of Functional Genomics. Nature Reviews Genetics, 3(3), 210–219.
http://doi.org/10.1038/nrg759
Nishimura, O., Hosoda, K., Kawaguchi, E., Yazawa, S., Hayashi, T., Inoue, T., … Agata, K. (2015).
Unusually large number of mutations in asexually reproducing clonal planarian Dugesia
japonica. PLoS ONE, 10(11), 1–23. http://doi.org/10.1371/journal.pone.0143525
Otto, S. P., & Hastings, I. M. (1998). Mutation and selection within the individual. Genetica, 102-
103(1-6), 507–524. http://doi.org/10.1023/A:1017074823337
Otto, S. P., & Orive, M. E. (1995). Evolutionary consequences of mutation and selection within
an individual. Genetics, 141(3), 1173–1187.
Pollak, E. (1983). A New Method for Estimating the Effective Population Size from Allele
Frequency Changes. Genetics, 104(3), 531–548.
Reddien, P. W. (2013). Specialized progenitors and regeneration. Development, 140(5), 951–7.
http://doi.org/10.1242/dev.080499
Reddien, P. W., & Alvarado, A. S. (2004). Fundamentals of Planarian Regeneration. Annual
181
Review of Cell and Developmental Biology, 20(1), 725–757.
http://doi.org/10.1146/annurev.cellbio.20.010403.095114
Rink, J. C. (2013). Stem cell systems and regeneration in planaria. Development Genes and
Evolution, 223(1-2), 67–84. http://doi.org/10.1007/s00427-012-0426-4
Szathmáry, E., & Maynard Smith, J. (1995). The major evolutionary transitions. Nature,
374(6519), 227–232. http://doi.org/10.1038/374227a0
Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism. Genetics, 123(3), 585–95. Retrieved from
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1203831&tool=pmcentrez&r
endertype=abstract
Uricaru, R., Rizk, G., Lacroix, V., Quillery, E., Plantard, O., Chikhi, R., … Peterlongo, P. (2015).
Reference-free detection of isolated SNPs. Nucleic Acids Research, 43(2), e11.
http://doi.org/10.1093/nar/gku1187
Van Wolfswinkel, J. C., Wagner, D. E., & Reddien, P. W. (2014). Single-cell analysis reveals
functionally distinct classes within the planarian stem cell compartment. Cell Stem Cell,
15(3), 326–339. http://doi.org/10.1016/j.stem.2014.06.007
Wagner, D. E., Ho, J. J., & Reddien, P. W. (2012). Genetic regulators of a pluripotent adult stem
cell system in planarians identified by RNAi and clonal analysis. Cell Stem Cell, 10(3), 299–
311. http://doi.org/10.1016/j.stem.2012.01.016
Wagner, D. E., Wang, I. E., & Reddien, P. W. (2011). Clonogenic Neoblasts Are Pluripotent Adult
Stem Cells That Underlie Planarian Regeneration. Science, 332(6031), 811–816.
http://doi.org/10.1126/science.1203983
Waples, R. S. (1989a). A generalized approach for estimating effective population size from
temporal changes in allele frequency. Genetics, 121(2), 379–91. Retrieved from
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1203625&tool=pmcentrez&r
endertype=abstract
182
Waples, R. S. (1989b). Temporal Variation in Allele Frequencies: Testing the Right Hypothesis.
Evolution, 43(6), 1236–1251.
Waples, R. S., & Yokota, M. (2007). Temporal estimates of effective population size in species
with overlapping generations. Genetics, 175(1), 219–233.
http://doi.org/10.1534/genetics.106.065300
183
Supplementary figure 4.1: Evidence for drift in the data
Distribution of p-values and FDRs for the 𝜒 2
heterogeneity test at 100 loci:
Temporal fluctuation of allele frequencies for at randomly chosen loci:
184
Supplementary table 4.1 is available in .xlsx format in electronic version of this dissertation.
Abstract (if available)
Abstract
I have focused my research during PhD training on finding novel statistical methods to analyze -omic data based on population and evolutionary genetic theory. I participated in multiple projects where I designed and executed the main data analysis tasks while field work, lab work and initial data pre-processing (e.g. QC and read mapping to call SNPs or calculate raw gene expression values) were carried out by other members of the team. This dissertation has been formatted into four chapters each corresponding to one of my projects. The findings of two of the research projects as well as a review article have been already published. A manuscript has been submitted presenting the findings of the last project. ❧ Chapter 1. Transcriptomic analysis of male-feminizing Wolbachia infection in the leafhopper Zyginidia pullula. Through principal component analysis of transcriptomes of the four groups (infected and uninfected chromosomal male and females), we were able to show that the transcriptomes clustered not according to sexual phenotype (3 phenotypic females vs. 1 phenotypic male), but based on infection status (2 uninfected samples vs. 2 infected samples). This meant that the effect of Wolbachia infection at transcriptomic level was widespread and by no means limited to sexual development. ❧ Chapter 2. Evolutionary genomics of the house mosquito Culex pipiens: Population structure and recent adaptive evolution. Using some of the established methods and a number of novel ones, We showed that population structure and the genomic pattern of natural selection in Culex mosquitoes depends more on geographical location than ecology (urban vs. suburban) or biotype (molestus vs. pipiens). We showed that numerous genes and pathways were undergoing adaptive evolution, including those conferring insecticide resistance and cold resistance. ❧ Chapter 3. Toward high resolution population genomics of archaeological samples. We reviewed the biochemical peculiarities of ancient DNA (degradation, fragmentation, chemical modification, and contamination) and the experimental and analytic methods tailored to extracting genetic information from it. ❧ Chapter 4. Inferring the effective number of stem cells, somatic mutation rate and cellular dynamics of regeneration in asexual Planaria. We applied population genetic theory to estimate the effective size of stem cell population during regeneration cycles of Planaria. We also estimated the somatic mutation rate and found indications suggesting the presence of two genetically diverged stem cell types.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ancestral inference and cancer stem cell dynamics in colorectal tumors
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Studies in bivalve aquaculture: metallotoxicity, microbiome manipulations, and genomics & breeding programs with a focus on mutation rate
PDF
Modeling mutational signatures in cancer
PDF
From gamete to genome: evolutionary consequences of sexual conflict in house mice
PDF
Complex mechanisms of cryptic genetic variation
PDF
Robustness and stochasticity in Drosophila development
PDF
Understanding the genetics, evolutionary history, and biomechanics of the mammalian penis bone
PDF
Exploring stem cell pluripotency through long range chromosome interactions
PDF
Biological interactions on the behavioral, genomic, and ecological scale: investigating patterns in Drosophila melanogaster of the southeast United States and Caribbean islands
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
Bayesian hierarchical models in genetic association studies
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Beyond genotypes: genealogy-based inference of population structure and demographic history
PDF
Rational selection of CRISPR/Cas9 guide RNAs for homology directed genome editing and its utility in the development of gene therapies
PDF
The cancer stem-like phenotype: therapeutics, phenotypic plasticity and mechanistic studies
PDF
Interaction of epigenetics and SMAD signaling in stem cells and diseases
PDF
Computational analysis of genome architecture
PDF
3D modeling of eukaryotic genomes
Asset Metadata
Creator
Asgharian, Hosseinali
(author)
Core Title
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
09/28/2016
Defense Date
07/26/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptation,ancient DNA,Culex pipiens,histones,male feminization,OAI-PMH Harvest,planaria,population structure,principal component analysis,Regeneration,selective sweeps,somatic mutation rate,stem cell,transcriptome,Wolbachia
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey (
committee chair
), Dean, Matthew (
committee member
), Ehrenreich, Ian (
committee member
), Marjoram, Paul (
committee member
), Siegmund, Kim (
committee member
)
Creator Email
asgharia@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-307974
Unique identifier
UC11279475
Identifier
etd-AsgharianH-4823.pdf (filename),usctheses-c40-307974 (legacy record id)
Legacy Identifier
etd-AsgharianH-4823.pdf
Dmrecord
307974
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Asgharian, Hosseinali
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptation
ancient DNA
Culex pipiens
histones
male feminization
planaria
population structure
principal component analysis
selective sweeps
somatic mutation rate
stem cell
transcriptome
Wolbachia