Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Applications of next generation sequencing in sessile marine invertabrates
(USC Thesis Other)
Applications of next generation sequencing in sessile marine invertabrates
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLICATIONS OF NEXT GENERATION
SEQUENCING IN SESSILE MARINE INVERTABRATES
by
Tevfik Hamdi Kitapci
A dissertation presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In partial fulfillment of the
Requirements for the degree
Doctor of Philosophy
(Molecular Biology)
Molecular and Computational Biology section
August 2018
Committee in charge:
Prof. Sergey Nuzhdin
Prof. Dennis Hedgecock
Prof. Matthew D. Dean
Prof. Ian Ehrenreich
Copyright 2018 Tevfik Hamdi Kitapci
2
Acknowledgements
I would like to thank to my PhD advisor Dr. Sergey Nuzhdin for supporting me always and
allowing me to continue my PhD journey when I struggled the most, and allowing me to complete
and succeed in this very challenging odyssey.
I would like to thank to the co-author and the PI of the third and fourth chapter of my thesis:
Dr. Daniel Campo, for countless hours of discussion, helping me to be a better scientist and for
helping me learn how to write academic papers.
I would like to thank to all the co-authors for the second chapter of my thesis: Dr.
Colleen Burge, Dr. Collin Closek, Dr. Carolyn Friedman, and to the PI of the project Dr. Dennis
Hedgecock
I would like to thank to my committee members Dr. Matthew D. Dean and Dr. Ian
Ehrenreich
I would like to thank to my funding sources: California Sea Grant, Wrigley Marine Science
Center and USC Teaching Assistantship. I want to thanks to REU program and undergrads who
helped with the sample collection for my research.
I want to thank to USC Center for High-Performance Computing for providing
computational resources. Some of the computational work was performed using the George
Washington High-Performance Computing cluster, and I want to thank Keith Crandall for granting
me access to it.
Last but not least, I want to thank to all my friends especially my friends from ITA, Chiara,
Elena, Muye, Carol and Ibrahim who became my family away from home, Nuzhdin lab members
especially Wendy and Peter for all their help with scientific discussions and their friendship and
my family who were always with me
3
LIST OF TABLES AND FIGURES
Table 2.1 vQTL genotype frequencies, genotype dependent mortality and selection models for
family 12x52. Haplotype fitness are calculated using MLE approach. These peaks are shown
on Figure 2.6, upper panel. .................................................................................................. 32
Table 2.2 vQTL genotype frequencies, genotype dependent mortality and selection models for
family 58x19. Haplotype fitness are calculated using MLE approach. These peaks are shown
on Figure 2.6 upper panel. ................................................................................................... 32
Table 3.1 Gene ontology terms from genes correlated with with tide .......................................... 55
Table 3.2 Negative Association with tide ..................................................................................... 57
Table 3.3 Gene ontology terms from genes upregulated during day time .................................... 58
Table 3.4 Gene ontologies from genes upregulated at night time ................................................ 60
Table 4.1 Summary of assembly statistics .................................................................................... 83
Table 4.2 Pairwise genome-wide average FST calculations ........................................................ 83
Table 4.3 Ratio test for each population. For each population P, sum of FST values from pairwise
comparisons that include P is divided into the sum of FST values from pairwise comparisons
that does not include P. The resulting fold difference shows that pairwise comparisons
including Catalina Island is ~58% higher than others. ........................................................ 84
Figure 2.1 Mortality curve for two families exposed to OsHV-1-caused mortality in Tomales Bay,
CA. Red rectangle shows the samples used in the downstream analysis in this study. Samples
collected on 10/31/2015 are kept in the freezer for future studies ...................................... 25
Figure 2.2 Viral load estimation for each family pre- and post-mortality samples. Pseudo count of
1 was added to each sample to handle zero viral loads. ...................................................... 26
4
Figure 2.3 QTL results for viral load family 12x52. 1 peak at LG 10 is found to be statistically
significant (indicated by black arrow) based on permutation test with LOD score 4.1 and p-
value=0.03. .......................................................................................................................... 27
Figure 2.4 Misassembly status of each scaffold in the C. gigas genome assembly. Each blue dot
represents a misassembled scaffold. Each red dot represents a correct scaffold. Logistic
regression on the missassembly status revealed that both scaffold length and marker count
are statistically significant ................................................................................................... 29
Figure 2.5 viability QTL peaks a) Family 12x52. b) Family 58x19. Horizontal line shows the
statistical significance at α = 0.05 level. Black arrows point to the highest peak in each LG
which is distorted in the post-exposure samples but not distorted in the pre-exposure
samples. ............................................................................................................................... 30
Figure 2.6 viability QTL based on the goodness-of-fit test with pre-exposure samples as expected
distribution and post-exposure sample as observed distribution. Horizontal line is the α =0.05
significance level after Bonferroni multiple testing correction. Black arrows points to vQTL
peaks in each LG. Genotype frequencies corresponding to each peak and the exact location
of the peak for family 12x52 is given on Table 2.1 and for family 58x19 is given on Table
2.2 ........................................................................................................................................ 31
Figure 3.1 Sampling scheme (A) Average principal component 1 for each time point. PC1 explains
34% of the variation. Error bars shows the standard deviations (B) Tide height (feet) for each
time points. Horizontal dashed line shows the sampling location. Samples below this line
are covered with water. Samples above this line are exposed. Vertical line marks the sunset
at 20:07. PC1 oscillates following the tide height. Genes correlated and anti-correlated with
PC1 are analyzed. ................................................................................................................ 51
5
Figure 3.2 Daytime vs Nighttime differential expression analysis. Horizontal red line shows the
statistical significance level α = 0.05. Horizontal lines show the fold change thresholds -1 on
the left and 1 on the right. 20 genes are upregulated at night and 8 genes are upregulated at
day time. .............................................................................................................................. 52
Figure 3.3 DE Exposed vs Covered Horizontal red line shows the statistical significance level α =
0.05. Horizontal lines show the fold change thresholds -1 on the left and 1 on the right. 1
gene is upregulated in the exposed group and 3 genes are upregulated in the covered group
............................................................................................................................................. 53
Figure 3.4 Positive Association with tide 30 GO Terms detected GO terms are shown in yellow
............................................................................................................................................. 56
Figure 3.5 Negative Association with tide for four, GO Terms detected (yellow). ..................... 58
Figure 3.6 Upregulated at Day 6 GO Terms detected GO terms are shown in yellow ................ 59
Figure 3.7 Upregulated at Night 12 GO Terms detected GO terms are shown in yellow ............ 61
Figure 4.1 Principal component analysis (PCA) of minor allele frequencies in the four populations
analyzed in this work. a) PC1 and PC2, and b) PC1 and PC3. ........................................... 81
Figure 4.2 Boxplots of FST distribution for all pairwise comparison among the four populations.
............................................................................................................................................. 81
Figure 4.3 FST values across all scaffold in the assembly. The red line indicates 0.1% quantile. a)
Pairwise comparisons involving Catalina Island b) Pairwise comparisons without Catalina
Island ................................................................................................................................... 82
6
Table of Contents
Chapter 1 General Introduction ...................................................................................................... 7
Chapter 2 Mapping genes conferring resistance to OsHV-1 in the Pacific Oyster Crassostrea gigas
............................................................................................................................................. 12
2.1 Introduction ......................................................................................................................... 13
2.2 Materials and Methods ........................................................................................................ 16
2.3 Results ................................................................................................................................. 25
2.4 Discussion ........................................................................................................................... 34
Chapter 3 Investigating circatidal and circadian clocks in the intertidal habitat of Pollicipes
polymerus using whole transcriptome sequencing .............................................................. 38
3.1 Introduction ......................................................................................................................... 39
3.2 Materials and Methods ........................................................................................................ 42
3.3 Results ................................................................................................................................. 49
3.4 Discussion ........................................................................................................................... 61
Chapter 4 NGS to the rescue: the power of whole genome approaches to detect fine-scale
population structure ............................................................................................................. 66
4.1 Introduction ......................................................................................................................... 67
4.2 Materials and Methods ........................................................................................................ 70
4.3 Results ................................................................................................................................. 74
4.4 Discussion ........................................................................................................................... 75
Chapter 5 General Conclusion ...................................................................................................... 85
References ..................................................................................................................................... 87
7
Chapter 1 General Introduction
Seventy percent of Earth’s surface are oceans and 99% of the Earth’s living space is
provided by the oceans. Besides, oceans provide the highest amount of protein for human
consumption although it is being “farmed” mostly from wild animals. Domestication of marine
life is almost not existent. Domestication of wild species improves yield and efficiency of farming.
(Turcotte, Araki, Karp, Poveda, & Whitehead, 2017). This is a contrast to land-based agriculture
production, which is mainly based of domesticated products. Ocean based farming, also called
aquaculture, has great potential to feed the world. More precisely, aquaculture is defined as the
breeding, rearing, and harvesting of all types of animals and plants in any type of water
environments. Although aquaculture is a very old practice, up until the 1970’s the amount of sea
food produced by aquaculture was approximately the same as the amount of sea food produced by
wild capture. After the 1990’s, production from wild capture stayed the same, while aquaculture
production increased steadily every year up until 2016 (FAO 2018) and is predicted to continue to
grow (FAO, 2016). The main reason for this is that we are already harvesting the wild as much as
we can and any further harvesting will result in the decay of natural environment, which provides
the food. To prevent this, worldwide fisheries management activities are in place. Given that
harvesting the wild habitats is not a sustainable way to increase food production, aquaculture is a
great solution for this problem.
There are different types of aquaculture, such as finfish aquaculture or shellfish
aquaculture. Finfish aquaculture is an effective way of producing a high amount of fish without
depleting the natural fisheries but it has many disadvantages compared to farming sessile
invertebrates such as shellfish or barnacles. Farming of sessile invertebrates such as mussels,
8
oysters or barnacles has many advantages compared to finfish aquaculture, among which are
animal density and feeding requirements. First, these animals naturally occur at high densities in
their natural habitat. Having a high animal density is a requirement for any farming practice. Other
species, such as fish, need to be confined in a small space, which is very different than their natural
habitat (Fraser, Weary, Pajor, & Milligan, 1997). Sessile invertebrates can often be kept at high
density on farms, which are in the natural habitat. Secondly, fish require a very high amount of
feeding material, which can be a problem both economically and environmentally (Davis, 2015).
Sessile invertebrates, such as shellfish or barnacles, are filter feeders and can feed on the available
algae and zooplanktons (FAO, 2016). The amount of waste produced by finfish aquaculture like
salmon can have serious environmental impact (Hargrave, 2010).
One problem that arises when animals are kept in high densities is the spread of disease.
Viral and bacterial infections can easily transmit and kill crops. These types of diseases are
manifested based on the interaction between the host and disease agent, which has three
components: environment, timing and genetic background. The environmental conditions need to
be such that the animal is susceptible and the disease agent is active. Disease agent needs to attack
the host in the right time of the year and at the right stage in the host’s development. Beside the
genetic background of the host animal is a very crucial factor in the resistance of the host against
the disease agent. It has been shown that selective breeding can improve resistance against
bacterial and viral infections (Houston & Houston, 2017; Kjøglum, Henryon, Aasmundstad, &
Korsgaard, 2008; Yáñez, Houston, & Newman, 2014). One method to tackle this problem is by
understanding genetic basis of disease causing agents. In this thesis, the genetic basis of OsHV-1
viral resistance in Pacific oyster, Crassostrea gigas, is investigated. Pacific oyster is a very
important sea food and it is one of the most produced shellfish species. This virus attacks oysters
9
early in their development and can cause huge mortalities, up to 100% (Roque et al., 2012; Segarra
et al., 2010). This is a major problem in the oyster industry. The second chapter of this thesis is
about using a next generation genotyping method, called genotyping-by-sequencing, to dissect the
genetic basis for resistance to the OsHV-1 virus. This will allow development of biomarker
assisted selection to improve the aquaculture operations.
In the third chapter, the issue of environmental signals that effect the internal clocks will
be investigated. To study internal clocks, based on lunar and solar cycles, we focus on the
Gooseneck barnacle, Pollicipes polyemerus, which is a rocky intertidal arthropod. This is an
extreme habitat with constantly changing environment due to tides. This species is also an
important food source with big fisheries in Canada. A sister species (Pollicipes pollicipes), which
is morphologically similar to P. polymerus, is the focus of large fisheries in Spain and Portugal.
Although this species has the same important properties mentioned above for oysters (animal
density and feeding requirements) to be used in aquaculture, there isn’t any successful aquaculture
operation for this species. This is mainly because no one has been able to keep P. polymerus alive
in an aquaculture setting. We hypothesize that this is mainly due to the peculiar characteristics of
the intertidal habitat. P. polymerus is an exclusive species to the intertidal and it is never found
sub-tidally. It is always found in the exposed areas with very strong wave action. Those conditions
need to be simulated in order to achieve aquaculture activities for this species. This chapter will
investigate how barnacles are adapted in this habitat through regulation of their gene expressions.
The fourth chapter is about the investigation of population structure of P. polymerus and
generating a de-novo genome assembly for this species. Availability of a genome assembly is
crucial to study population genomics or to implement genomics based selection strategies.
10
All chapters of this thesis utilize NGS (next generation sequencing) to solve problems in
ecologically and commercially valuable marine sessile invertebrates. The second chapter tackles
the problem of genetic basis of disease resistance. Previous studies investigated the same problem
using SNP arrays or microsatellites (Gutierrez et al., 2018; Sauvage et al., 2010; Segarra et al.,
2010). With the rapid cost reduction in NGS, there is likely to be a shift to this approach in this
field (Park & Kim, 2016). There is no need for the construction of SNP arrays, which can be costly
for commercial application, for NGS. Although there is a need for a good quality reference genome
to utilize NGS for problems such as understanding the genetic basis of disease resistance. The
study organism in the second chapter of this thesis, Crassostrea gigas, already have a decent
quality genome (Zhang et al., 2012), but it is still far from being a chromosomal-level genome
assembly (Hedgecock et al 2015). Because of this, we constructed a linkage map from our families
to be able to position our markers on 10 linkage groups. Although the absence of an accurate
genome assembly, is a limitation for our study, as the cost of NGS declines, our approach is likely
to be more suitable for commercial application. Besides, being able to perform genomic analysis
without relying on pre-constructed SNP arrays allows for detection of previously unknows variants
in natural populations. In the third chapter, we are investigating the internal clocks of Pollicipes
polymerus, using whole-transcriptome sequencing. Previous studies relied on microarrays for
transcriptomic studies (Gracey et al., 2008). These microarrays need to be designed specifically
for each study organism. In this chapter, we investigate the whole transcriptome without relying
on any previous genomics information. We generated a de-novo transcriptome assembly from the
samples and annotated the transcriptome by using public databases. The constructed transcriptome
was used to investigate gene expression patterns to understand the internal clock of P. polymerus.
The pipeline developed in this chapter can be easily extended to study other non-model animal
11
systems that lacks genomic resources. The fourth chapter investigates the population structure of
P. polymerus. Here we use a pooled, whole-genome sequencing approach. As mentioned, there
are no genomic resources for this species, so we constructed a de-novo genome assembly and then
used pooled, whole-genome sequencing to investigate the population structure along the coast of
Southern California. This chapter generated a valuable resource for this species, which is a draft
genome assembly.
12
Chapter 2 Mapping genes conferring resistance to OsHV-1 in the Pacific Oyster
Crassostrea gigas
Abstract
Understanding the genetic basis of resistance to ostreid herpes virus 1 (OsHV-1) by the
Pacific oyster is important for the global oyster industry. The virulent Var forms of OsHV-1 have
caused devastating losses of farmed Pacific oysters in Western Europe, Australia, New Zealand,
and Asia, and OsHV-1 µVar is expected eventually to arrive on the U.S. West Coast. Since 1993,
OsHV-1 has caused mass mortalities of seed and juvenile Pacific oysters in Tomales Bay, CA,
with losses ranging from 50% to 100% during the warmest summer months. This project,
therefore, was designed to generate high-density linkage maps to detect QTL (quantitative trait
loci) for resistance to OsHV-1 infection or mortality. Two full-sib families (12 52 and 58 19)
from four unrelated parents were produced and planted in Tomales Bay before OsHV-1-caused
mass mortalities in late September 2015. Over a two-week period, family 1252 experienced 55%,
and family 58 19, 16% mortality. Reduced representational genomic libraries were constructed
for pre- and post-exposure samples (268 individuals from 1252 and 290 individuals from 58 19)
as well as the parents of the two families. Individual libraries were bar-coded and sequenced in
pools, using genotyping-by-synthesis methods. Changes in genotypic frequencies particular to the
exposure period suggest that seven viability QTLs in families 58 19 and 12 52, are associated
with resistance to OsHV-1 and survival. Although the current state of the assembly of the Pacific
oyster genome impedes identification of candidate genes, these findings strongly suggest that
marker-assisted selection could be used to select oyster familyies resistant to OsHV-1 infection or
mortality.
13
2.1 Introduction
Disease susceptibility is an important problem in wild animals but especially in hatchery
and aquaculture settings. Bacterial and viral infections are a huge factor for the success of shellfish
farming operations (Blaylock & Bullard, 2014; FAO, 2016; Guo & Ford, 2016; Lafferty et al.,
2015; Morash & Alter, 2016; Pernet, Lupo, Bacher, & Whittington, 2016; Thitamadee et al., 2016)
For example, summer mortality is a very critical issue for the oyster industry (Dégremont, 2013).
Summer mortality is an event in which a large proportion of oyster seed dies during the summer
months. This is a complex phenomenon caused by multiple factors (Samain, 2011), but OsHV-1
infection is regarded as the main component for this detrimental effect, which causes enormous
problems for the oyster production (Friedman et al., 2005). France, Ireland, England, Australia,
New Zealand, and Tasmania have had mass mortalities from this virus.
There are different strains of the virus OsHV-1. A very virulent and high mortality causing variant
of OsHV-1, designated OSHV-1 µVar has been reported since 2008, first in France and then in
the other parts of the world (Roque et al., 2012). This version of the virus can kill entire crops in
the farms by attacking and killing oysters in any life stage (larvae, juvenile or adults) and it is a
serious problem for the aquaculture industry all around the globe. This variant of the virus can be
found on the NCBI database under the ascension number HQ842610. OSHV-1 µVar is
characterized by a 12 bp deletion in the ORF 4 of its genome compared to the original viral strain
which can be found on the NCBI database under the ascension number AY509253. This original
strain is less virulent although it can be present in adults does not causes mortality. Effected larvae
shows reduced swimming and feeding behavior and effected spat, defined as any oyster younger
than 1 years old, shows sudden and high mortality in the summer months (Garcia et al., 2011). In
this study we focus on the original, less virulent, viral strain of OsHV-1 which is, so far, only
14
reported on the west coast of USA, Tomales bay. The Tomales bay viral strain has been sequenced
and it is used in this study as the reference sequence for the virus (Collen Burge, personal
communication unpublished data). There are two main reasons to study the less virulent form in
this study. First this is the version detected in Tomales bay and it is a matter of time to start
spreading to other areas in the USA Pacific Coast. Second reason is the fact that animals selected
for OsHV-1 are shown to be resistant to OsHV-1 µVar as well (Dégremont, 2011). Viral infection
by OsHV-1 coincides with the summer mortality period but it is much more severe and it is
considered to be very important problem (Friedman et al., 2005). Since summer mortality and
OsHV-1 caused mortality events are confounded it is important to measure the viral load in the
samples to make sure the mortality is caused by OsHV-1. In this study we estimated the viral loads
of samples using two methods as described later in this chapter.
OsHV-1 virus is a DNA virus in the family Herpesviridae which infects mainly Pacific
oyster but it is also reported to infect other bivalve species (Davison et al., 2005). OsHV-1 is found
in the environment mainly in the summer months, and it quickly kills the animals. Transcriptomic
response to OsHV-1 infection involves apoptosis and the interferon-pathways in the early stages
of the infection (Green, Vergnes, Montagnani, & de Lorgeril, 2016) Different methods are
investigated to prevent oysters from mortality events. Because oysters do not have an immune
memory and they are farmed in an open environment, vaccination or disinfection methods are
ineffective (Dégremont, Garcia, & Allen, 2015). Fortunately, large additive genetic variation with
no genotype by environment effect is detected for resistance to OsHV-1 (Dégremont, Lamy, Pépin,
Travers, & Renault, 2015; Kube et al., 2018). This is an encouraging result for selective breeding
and for seeking biomarkers for genetic resistance. Recently some potential resistance markers for
OsHV-1 have been detected by GWAS (Gutierrez et al., 2018).
15
In this project, we aim to understand the genetic basis of resistance to OsHV-1 in Pacific
oyster by mapping genes responsible for resistance against this virus. We adopted a proven
quantitative trait loci (QTL) mapping technique (Plough & Hedgecock, 2011; Plough, Shin, &
Hedgecock, 2016). Traditionally, QTL mapping is based on association of genetic markers and a
phenotype. Genetic markers can be derived from microsatellites or more recently from Single
nucleotide polymorphisms (SNP). Thanks to improvements in next generation DNA sequencing
technology (NGS) in recent years, it has become feasible to sequence whole genomes and generate
thousands of SNP markers to be used for QTL analysis. There are three inputs for QTL analysis:
Genetic markers, a genetic map (linkage map) and individual phenotypes. The logarithm of the
odds ratio or LOD score is calculated between each genetic marker and the phenotype based on
the genetic map structure. The final aim of QTL mapping, or any association mapping, is to locate
the causal loci for a certain phenotype. This is not always possible since a QTL can be detected in
a nearby locus in linkage. In that case, nearby genomic locations should be investigated to find the
causal loci. Sometimes a phenotype might be impossible to observe, especially in a field
experiment, such as viability since dead animals cannot be sampled. This is the case in our study,
in which we are investigating the mortality caused due to OsHV-1 infection. In this case, we can
infer the QTL by looking at the genotype frequencies. In mapping experiments, distortion from
Mendelian ratios are expected for markers associated with viability loci. (Luo & Xu, 2003). This
approach allows mapping viability QTLs without an actual observed phenotype (Plough &
Hedgecock, 2011; Plough et al., 2016). Besides, if samples are taken prior to exposure to field
mortality, then genotype frequencies in pre-exposure samples can be used as the null model,
instead of Mendelian ratios, to compare with genotype frequencies in post-exposure samples.
16
We generated two random mated full-sib families from pairwise crosses of unrelated parents. The
parents are chosen from a population (Pipestem Inlet, Vancouver, BC), from which families
showing good survival to summer mortality derived. Usually QTL experiments are performed with
F2 or backcross families but it is possible to conduct QTL mapping with random-bred full-sib
individuals, as long as the trait in question segregates within each family.
There are many reduced genome sequencing techniques, such as genotyping by sequencing
(GBS; Elshire et al., 2011), restriction-site associated DNA sequencing (RAD; Baird et al., 2008),
RNA-seq or exome sequencing (Teer & Mullikin, 2010). The first two techniques digest the
genomic DNA with a restriction enzyme to reduce the complexity in order to sequence a small
portion of the genome (approximately 1-4%). RNA-seq and exome sequencing in practice allows
the same type of reduction in the amount of genomic region sequenced but they target exclusively
protein coding regions. Since GBS and RAD sequencing digest a random region of the genome it
allows sequencing both protein coding and non-coding regions. For this study, we choose GBS
sequencing to be able to multiplex a large pool of individuals in a single Illumina lane by
sequencing a small portion of the genome for each sample, in a cost-efficient manner. We
hypothesize that viability QTLs detected based on the difference in genotype ratios before and
after exposure to the OsHV-1 will be responsible for viral resistance and can be used for marker-
assisted selection to improve crop resilience against this disease agent.
2.2 Materials and Methods
Family generation and field trials
Pacific oyster adults (broodstock) were retrieved from Thorndyke Bay, Washington
(donated by Baywater Shellfish Company). The broodstock came from full-sib families derived in
17
2010 from controlled pair-crosses of wild-caught Pipestem Inlet, Vancouver Island, B.C. parents
(families 12, 19, 52, and 58). Ten broodstock from each family were selected, cleaned, and place
in temperature regulated tanks for conditions at the Taylor Shellfish Hatchery in Quilcene, WA.
The oysters were brought in from ambient seawater, ~10° C, and the temperature of their holding
was gradually increased to 23 C, over a period of six weeks. Once broodstock were conditioned
and their gametes fully ripened, the oysters were brought to the Kenneth K. Chew (KKC) research
hatchery in Manchester, Washington, to make crosses. Each adult was sexed and gametes were
rated for quality. The highest quality females came from families 52 and 19. The resulting high-
quality eggs were crossed with sperm from males of families 12 and 58, thus resulting in the
crosses 12×52 and 58×19. Larvae were reared at the KKC hatchery, using standard protocols
(Helm, 2004), until day 18 post fertilization, when they were shipped to the University of
California, Davis Bodega Marine Laboratory for remote setting. Roughly 82,000 12×52 spat and
49,000 58×19 spat, a total of 131,000, were settled and placed in a flow-through table system and
allowed to grow to >4760 µm before planting them in Tomales Bay, California (38°08'01.0"N
122°53'39.0"W). Replicate bags of each family line (n=800 per bag) were outplanted on 8/27/15.
A pre-plant (or pre-exposure) sample of 100 oysters per family was taken prior to outplant, in order
to account for the high genotype-dependent mortality that occurs during larval stages and
metamorphosis (Launey & Hedgecock, 2001; Plough & Hedgecock, 2011). Oyster survival was
estimated (on 9/11/15, 9/22/15, and 10/31/15) by counting 100 individuals in each bag, and
removing all empty shells or gaping oysters using standard methods (Burge et al. 2006, 2007).
Samples (n=100 per bag) were collected on 9/22/15 and 10/31/15. Samples were flash frozen in
liquid nitrogen in situ and transferred directly to a -80 C freezer at BML. Samples were shipped
first to the Institute of Marine and Environmental Technology, in Baltimore, MD on dry ice before
18
transfer (also on dry ice) to the University of Southern California. Samples collected on 9/22/2015
are used for the downstream analysis described in this chapter. Samples collected on 10/31/2015
are kept in the -80 C freezer for future studies.
DNA extraction
Each frozen animal was transferred on dry ice to avoid thawing of the samples. Each
animal’s height was measured then each animal carefully and quickly dissected under a dissection
microscope and the animal was transferred into the DNA extraction buffer with Proteinase K.
Proteinase K digestion was performed overnight at 55 °C. Zymo Research Quick-DNA Universal
96 kit (Catalog No: D4071) was used to extract the DNA using manufacturer’s guidelines. DNA
quality was assessed by running the samples on a 1% agarose gel and the purity of the DNA was
evaluated with NanoDrop™ 2000/2000c Spectrophotometer to make sure a A260/A280 ratio bigger
than 1.8. The quality of the DNA was measured with Quant-iT™ PicoGreen™ dsDNA Assay Kit
(Catalog No: P7589). DNA was dried using Savant™ SpeedVac™ High Capacity Concentrator in
96 well plates. Required amount of ultra-pure water was added to each sample to bring them to a
concentration of 10 ng/µl, samples were left at room temperature overnight to allow enough time
for DNA to dissolve. Since the required input for GBS library construction was 100 ng of DNA
10 µl of sample was taken to be used in the next step.
Genotyping by sequencing library construction
Construction of genotyping by sequencing (GBS) libraries was based on the protocol
described in (Elshire et al., 2011). Multiple restriction enzymes were evaluated in a previous study
both using molecular techniques: restriction digestion followed by agarose gel and bioanalyzer
19
analysis and computational techniques, using in-silico digestion based on the cut site of restriction
enzymes (A. Arias-Perez, personal communication). ApoI restriction enzyme from NEB (Catalog
No: R0566L) is used in this protocol. In the final step of the protocol AGENCOURT® AMPURE®
beads were used for cleaning and size selection with a 0.7× beads to sample ratio. Barcoded
samples were pooled following the original protocol and each 96 sample is sequenced on two lanes
of Illumina HiSeq 2500.
Viral load estimation
Viral load is estimated by mapping the GBS reads from Illumina sequencing to each sample
with BWA (H. Li & Durbin, 2009) against the genome sequence of OsHV-1 reference strain
(Collen Burge, personal communication) . Viral load estimation is done using the formula
# 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 𝑚𝑎𝑝𝑝𝑒𝑑 𝑡𝑜 𝑂𝑠𝐻𝑉 − 1 𝑔𝑒𝑛𝑜𝑚𝑒 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑅𝑒𝑎𝑑𝑠 𝑖𝑛 𝑡 ℎ𝑒 𝑙𝑖𝑏𝑟𝑎𝑟𝑦
𝑥 10
6
The number of reads mapped to the viral genome is divided by the total number of reads
in the library to normalize the read counts by the total library size to make samples comparable to
each other.
Viral load is also estimated using the qPCR protocol described by Burge & Friedman (2012; C.
Burge, personal communication).
Quality assessment of Illumina reads and debarcoding
Sequencing adapters and low quality reads are trimmed with Trimmomatic (Bolger, Lohse,
& Usadel, 2014), using the default settings. The quality of reads are checked with FastQC
(Andrews, 2012). Reads are de-barcoded using a custom Python script.
20
Linkage Map Construction
A separate linkage map is constructed for each family. Trimmed reads are mapped using
BWA (H. Li & Durbin, 2009) to the Pacific Oyster genome assembly (G. Zhang et al., 2012),
using the default parameters. Samtools version 1.5 (H. Li et al., 2009) is used to generate a pileup
file using “mpileup” command with parameters “-q 10 -Q 10 -s”. Genotype likelihoods are
estimated from this mpileup file using the script provided by LepMap3 (Rastas, 2017), and a
linkage map is constructed using the same software. First “ParentCall2” module is called with the
parameter “removeNonInformative=1”. Then, the “Filtering2” module is called with default
parameters. Then “SeparateChromosomes2” module is called to assign markers in linkage groups
(LG). The “lodLimit” parameter of this module is crucial in the map construction. This parameter
is optimized with two criteria: 1) ~90 % of markers are on the first 10 LG 2) Markers are distributed
close to uniformly on these 10 LG. After the optimization for family 12×52 lodLimit is set to 20
and for family 58×19 lodLimit is set to 40. Then “JoinSingles2All” module is called with default
parameters. Lastly “OrderMarkers2” module is called to order the markers within each LG with
parameters “outputPhasedData=1 sexAveraged=1”. In this method only positions that are
heterozygous in both parents (ab x cd) are considered. This generates four possible genotypes for
the progeny ac, ad, bc and bd. Linkage groups obtained in both families are mapped to the
consensus map as described in the next section.
Assignment of Linkage groups to the consensus map
A consensus linkage map is published for Pacific oyster previously (Hedgecock, Shin,
Gracey, Den Berg, & Samanta, 2015; Hubert, Cognard, & Hedgecock, 2009; Hubert &
Hedgecock, 2004). Linkage groups we obtained are mapped on the linkage groups of the consensus
21
map. The mapping is done based on the scaffolds the marker is found using the following
algorithm. “LG” denote the original linkage group assigned by LepMap3 in this study and
“Compendium_LG” denote the linkage group defined in the consensus map
for each LG (1 to 10):
for each Marker:
Look up the genomic coordinates (scaffold name and
position) of the marker
Find on which Compendium_LG this marker is mapped
if (at least more than half of the markers within
this LG maps to the same Compendium_LG)
LG <- Compendium_LG
Detection of misassembled scaffolds in the Oyster genome
Errors in the Pacific oyster genome scaffold assembly are reported in a previous study.
(Hedgecock et al., 2015; Xiaoshen et al. unpublished) After assigning the linkage groups in our
study to the LGs defined in Compendium_LG, we included markers generated in this study to all
the markers that constitutes the Compendium_LG and generated a list of 10583 markers (we call
this the Compendium list). We define a scaffold as “misassembled” if two or more markers within
a scaffold map to two or more different LGs. Then we fit a logistic regression to the “misassembly”
status as
𝑀𝑆 ~𝑀𝐶 + 𝐿 + 𝑀𝐶 𝑥 𝐿
where
22
MS: Is the misassembly status; it is one if the scaffold is “misassembled” and zero if it is
“correct”
MC: Marker count, number of markers on the scaffold
L: The length of the scaffold
MC x L: Interaction term
The logistic regression is implemented in R using the glm() function with the parameter
family = binomial.
Viability QTL analysis
When molecular markers in mapping experiments show distortion from expected
Mendelian ratios, these markers are likely to be associated with some viability loci (L. Luo & Xu,
2003). Two methods are used to investigate the distortion of Mendelian ratios in pre-exposure and
post-exposure samples. First R/QTL package (Broman, Wu, Sen, & Churchill, 2003) an R (Core
Team, 2017) package is used. In order to test the deviation from Mendelian segregation ratios,
geno.table() is used. Secondly, in order to verify the results from the R/QTL, PROC QTL (Hu &
Xu, 2009) a SAS procedure that implements the Maximum likelihood method described in (L. Luo
& Xu, 2003) is used. If a marker shows significant distortion from Mendelian ratios in the post-
exposure sample, when it does not show a significant distortion in the pre-exposure sample, then
this marker is labelled as a potential viability QTL. Since the pre-exposure samples gives the
expected genotype frequencies in the absence of the OsHV-1 infection next, we performed a
goodness-of-fit test, using pre-exposure genotype frequencies at each marker as the expected
frequencies and the post-exposure genotype frequencies as the observed frequencies. Goodness-
of-fit test is implemented using R chisq.test() function, and p-values are corrected using Bonferroni
23
multiple testing correction. Markers showing statistically significant difference are marked as
potential vQTLs. The resulting vQTL peaks from the goodness-of-fit test were used in the
downstream analysis described in the next section.
Selection models for vQTLs
Fitness of each viability QTL for OsHV-1 resistance is estimated using the following model
𝑓 𝑜 = 𝑓 𝑒 (𝜃 ) + 𝜀
where
fo is the observed genotype frequency in the post-exposure (after selection) sample
fe (θ) is the expected genotype frequency after selection
ε ~N (0, σ
2
I) is the error term which is normally distributed with mean 0 and standard
deviation σ
2
I
Since we are working on a cross of the form AB x CD in this experiment, four progeny
genotypes are possible AC, AD, BC and BD. The parameter θ is a vector composed of 4 fitness
coefficients for each haplotype. Fi (i ∈ A, B, C, D) denotes the fitness of haplotype i for OsHV-1
(or similarly si =1-Fi is the selection against haplotype i). Multiplicative (independent) interaction
of haplotype finesses are assumed for this model such that the fitness parameter of genoype i,j is
calculated as FiFj (where i,j ∈ A, B, C, D). Frequencies of genotypes before selection and after
selection are observed and used as variables in the model (8 observations with 6 degrees of
freedom) Expected genotype frequencies after selection (fe) are calculated based on observed
genotype frequencies before selection and the fitness of each haplotype using the following
formula:
𝑓 𝑒 (𝜃 ) = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑏𝑒𝑓𝑜𝑟𝑒 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑥
𝐹 𝑖 𝐹 𝑗 ∑ 𝐹 𝑖 𝐹 𝑗 𝑖 ,𝑗
24
We use a maximum likelihood estimation to fit the best model. Probability of the observed
genotype frequency for a set of fitness parameters can be written as
𝑃 (𝑓 𝑜 , 𝜃 ) =
1
√2πσ
2
𝑒 −
(𝑓 𝑜 −𝑓 𝑒 (𝜃 ))
2
2σ
2
Parameter θ is vector that is composed of the four haplotype fitness parameters described
above.
Taking the natural log of both sides gives
ln (𝑃 (𝑓 𝑜 , 𝜃 )) = −
(𝑓 𝑜 − 𝑓 𝑒 (𝜃 ))
2
σ
2
1
√2πσ
2
Finding the maximum of the above likelihood which respect to the parameter θ is
equivalent to finding the minimum of |fo- fe(θ)| so this is the Maximum likelihood estimation
(MLE) for the parameter θ. We used L-BFGS-B numerical optimization method to find the fitness
parameters (Byrdt, Lut, Nocedalt, & Zhu, 1995). Parameters are constrained to the range [0,1] and
initial value for optimization is set to 1 for all parameters. The distance |fe(θ)-fo| is calculated using
Euclidian distance. Optimization algorithm is implemented in R using the optim() function.
Genotype dependent survival
Genotype dependent survival for each vQTL is calculated based on the genotype fitness
calculated in the previous section. The average fitness of four genotypes is assigned as the
genotype dependent survival for each vQTL.
25
2.3 Results
Field trials and mortality estimates
Two families showed differential mortality in the field. Family 12×52 had 55% mortality
and family 58×19 experienced 16% mortality (Figure 2.1).
Figure 2.1 Mortality curve for two families exposed to OsHV-
1-caused mortality in Tomales Bay, CA. Red rectangle shows the
samples used in the downstream analysis in this study. Samples
collected on 10/31/2015 are kept in the freezer for future studies
The presence of OsHV-1 is detected in both families and the average viral load is correlated with
the mortality experienced in each family.
26
Viral load estimation
Pre-exposure samples for both families showed viral load values close to zero and
significantly lower than those in post-exposure samples (p-value < 9.6 × 10
-12
), as expected. The
difference between two pre-exposure samples is marginally significant (p-value=0.051) but it has
a small effect size (-0.28) based on Cohen d effect size test. In the post-exposure samples, family
12×52 showed statistically significantly higher viral load than 58x19 (p-value < 2.2 ×10
-16
based
on Wilcoxon rank test). Cohen d effect size test between two post-exposure samples gave a large
effect size (0.8), similarly d estimate between pre-exposure and post-exposure for families 12×52
and 58×19 were also large (-1.07 ) and medium (-0.67). Figure 2.2 shows the log10 OsHV-1 copy
number for pre-exposure and post-exposure samples. The results are consistent with the fact that
12×52 experienced higher mortality in the field than 58×19
Figure 2.2 Viral load estimation for each family pre- and
post-mortality samples. Pseudo count of 1 was added to each sample
to handle zero viral loads.
27
The estimate of viral load based on GBS did not produce any significant QTL but
estimation based on qPCR yielded one significant peak on LG 10 (p-value=0.03) (Figure 2.3).
Statistical significance is measured based on a permutation test with 1000 iterations.
Figure 2.3 QTL results for viral load family 12x52. 1 peak at
LG 10 is found to be statistically significant (indicated by black
arrow) based on permutation test with LOD score 4.1 and p-
value=0.03.
Genotyping by sequencing
Two runs of sequencing are performed. After the first sequencing run, we had some low
coverage samples (less than 5×), to increase the coverage for these cases we pooled these samples
and did a second round of sequencing. The Pacific oyster genome is estimated to be 637Mb by
flow cytometry (Zhang et al., 2012) and GBS is expected to sequence 4 % of the genome. Based
28
on this assumption, the four parents (12, 52, 58, and 19) are sequenced to a coverage of
76×,31×,33×, and 58×, respectively. Pre-exposure samples from families 12×52 and 58×19 have
an average coverage of 15× and 19×, respectively and post-exposure samples from families 12×52
and 58×19 have an average coverage of 29× and 21×, respectively.
Linkage Map
Two linkage maps (LM) are constructed, one for each family. After optimizing the number
of markers distributed in the first 10 LG family 12×52’s LM had 1397 markers from 604 scaffolds
which totals to a size of 250Mb. Family 58×19 LM had 2247 markers from 1005 scaffolds which
total to a size of 365Mb. The number of scaffolds represented by markers in the LM for family
12×52 and 58×19 are 7% and 13% of the total number of scaffolds in the Pacific oyster genome
assembly, respectively and constitutes 39% and 57% of the total genome size respectively,
Misassembled scaffolds in the Oyster genome
In the compendium list, 10583 markers encompass 1628 scaffolds, of which 599 scaffolds
(36%) show evidence of being misassembled. Based on the logistic regression of misassembly
status, MC (Marker Count) (p-value < 1.73 × 10
-9
), L (scaffold length) (p-value < 2 × 10
-16
) and
MC x L (p-value <0.0182) interaction are found to be significant. Figure 2.4 shows the status of
misassembly of each scaffold present in the genome assembly which respect to the scaffold length
and the amount of GBS markers present on that scaffold. Every scaffold larger than ~750.000
basepairs are misassembled. This is a severe issue regarding the quality of the genome assembly
and it needs to be fixed to be able to reliably get functional annotation of candidate QTLs detected
in this study.
29
Figure 2.4 Misassembly status of each scaffold in the C. gigas
genome assembly. Each blue dot represents a misassembled scaffold.
Each red dot represents a correct scaffold. Logistic regression on
the missassembly status revealed that both scaffold length and
marker count are statistically significant
Viability QTL (vQTL) analysis
First, we investigate the distortion of the genotype frequencies from the expected
Mendelian ratios in pre-exposure and post-exposure samples. For family 12×52 we found five
markers showing Mendelian genotypic ratios in the pre-exposure sample and significantly
distorted markers in the post-exposure sample (LGs 1, 3, 5, 6,10 (Figure 2.5a)). For family 58×19
we found four markers showing Mendelian genotypic ratios in pre-exposure samples and
significantly distorted markers in the post-exposure sample (LGs 1,4,6,8 (Figure 2.5b)).
30
Figure 2.5 viability QTL peaks a) Family 12x52. b) Family
58x19. Horizontal line shows the statistical significance at α =
0.05 level. Black arrows point to the highest peak in each LG which
is distorted in the post-exposure samples but not distorted in the
pre-exposure samples.
Surprisingly, we also observed markers showing the opposite trend, significantly distorted
in the pre-exposure samples but with Mendelian genotypic ratios in the post-exposure sample. To
encompass both type of changes in genotype frequency distributions, we performed a goodness-
of-fit test and detected QTL markers on LGs 1,2,3,4,7,8,10 for family 12×52 and on LGs
1,3,4,6,7,8,10 for family 58×19 (Figure 2.6). In the next section, we fit the selection model for
each QTL peak detected in the goodness-of-fit analysis to understand how selection acts in these
markers to generate resistance.
31
Figure 2.6 viability QTL based on the goodness-of-fit test
with pre-exposure samples as expected distribution and post-
exposure sample as observed distribution. Horizontal line is the
α =0.05 significance level after Bonferroni multiple testing
correction. Black arrows points to vQTL peaks in each LG. Genotype
frequencies corresponding to each peak and the exact location of
the peak for family 12x52 is given on Table 2.1 and for family
58x19 is given on Table 2.2
Selection models for vQTLs
We fit a selection model based on MLE to understand haplotype fitness at each vQTL. For
family 12×52, two out of seven vQTL had selection against one haplotype, while the remaining
five vQTL had two haplotypes that were affected (Table 2.1). For family 58×19, all seven vQTL
had selection against two haplotypes (Table 2.2). In this model multiplicative (independent) effect
of each haplotype on genotype fitness is assumed.
32
Table 2.1 vQTL genotype frequencies, genotype dependent
mortality and selection models for family 12x52. Haplotype fitness
are calculated using MLE approach. These peaks are shown on Figure
2.6, upper panel.
Table 2.2 vQTL genotype frequencies, genotype dependent
mortality and selection models for family 58x19. Haplotype fitness
Haplotye(s) selected againt
LG pos(cM) AC BC AD BD A B C D AC BC AD BD Genotype dependent viability
Post-exposure 0.3 0.2 0.3 0.2
Pre-exposure 0.2 0.3 0.2 0.3
Post-exposure 0.3 0.3 0.2 0.2
Pre-exposure 0.2 0.3 0.3 0.3
Post-exposure 0.3 0.3 0.2 0.2
Pre-exposure 0.4 0.1 0.2 0.2
Post-exposure 0.2 0.2 0.2 0.3
Pre-exposure 0.3 0.3 0.3 0.1
Post-exposure 0.3 0.2 0.2 0.3
Pre-exposure 0.2 0.3 0.3 0.3
Post-exposure 0.4 0.2 0.2 0.2
Pre-exposure 0.1 0.3 0.2 0.4
Post-exposure 0.3 0.3 0.2 0.3
Pre-exposure 0.3 0.2 0.3 0.1
0.683969419
0.593232442
0.678852486
0.568814359
0.682509868
0.48812908 0.2
0.4 0.4 1 0.9
1
1 0.6 0.7 0.4
0.3
0.5 0.4 1 0.8
B and D
C
1 1 0.4
1 0.5 0.5
0.3 0.5 0.5
1 0.5 0.3
B and D
B and C
A and C
B and D
0.3
10 86.15 1 0.9 0.4 1
8 0 1 0.5 1
1
7 72.24 1 0.6 1 0.7
4 103.07 0.5 1 0.5
0.5
3 28.81 1 0.8 0.5 1
2 0 1 0.5 1
Genotype Frequencies Haplotype Fitness
1 50.2 1 1 1 0.4
Genotype Fitness
D 0.4 0.69882005
33
are calculated using MLE approach. These peaks are shown on Figure
2.6 upper panel.
Genotype dependent mortality
Genotype dependent viability is calculated from the genotype fitness. For family 12×52,
the viability of individual vQTLs range between 0.48 and 0.69, and for family 58×19, the viability
of individual vQTLs ranges between 0.45 and 0.76. Assuming each vQTL contributes to viability
independently than the combined effect of all vQTLs for each family can be calculated as
1 − ∏ 𝑆 𝑖
where Si is the genotype dependent viability of ith vQTL.
Then, families 12×52 and 58×19 have combined genotype dependent mortalities of 0.96 and 0.98,
respectively.
Haplotye(s) selected againt
LG pos(cM) AC BC AD BD A B C D AC BC AD BD Genotype dependent viability
Post-exposure 0.2 0.2 0.2 0.4
Pre-exposure 0.5 0.2 0.2 0.1
Post-exposure 0.2 0.2 0.3 0.2
Pre-exposure 0.3 0.1 0.2 0.4
Post-exposure 0.2 0.2 0.3 0.4
Pre-exposure 0.3 0.3 0.3 0.1
Post-exposure 0.3 0.2 0.3 0.3
Pre-exposure 0.1 0.3 0.2 0.4
Post-exposure 0.3 0.2 0.2 0.3
Pre-exposure 0.2 0.3 0.3 0.2
Post-exposure 0.2 0.3 0.3 0.2
Pre-exposure 0.3 0.3 0.1 0.3
Post-exposure 0.2 0.3 0.3 0.2
Pre-exposure 0.4 0.2 0.2 0.2
0.4 0.7 0.5 1
0.455641389
0.650466033
0.464745537
0.578863645
0.505396572
0.76477699
0.63248599
1
0.7 1 0.6 0.8
A and C
0.8 1 0.4 0.4
0.1 0.4 0.3 1
1 0.6 0.4 0.3
0.2 0.4 0.4
A and D
A and C
B and D
A and C
A and D
0.1 0.4 0.3 1 A and C
Genotype Frequencies Haplotype Fitness
1 69.69 0.3 1 0.4 1
0.8 1 1 0.4 3 35.95
0.4 1 0.4
1
6 76.95 1 0.6 1 0.4
4 51.64 0.3 1 0.4
Genotype Fitness
1 10 56.72 0.5 1 0.7
1
8 76.12 0.7 1 1 0.8
7 66.6
34
2.4 Discussion
In this project, we generated linkage maps for two families from sequencing data, in order
to detect QTLs responsible for resistance to OsHV-1. We detected one QTL for viral load and
seven vQTLs, by comparing the genotype frequencies in pre-exposure samples to the post-
exposure samples. We fitted a selection model to each vQTL using an MLE approach and
discovered selection against one or two haplotypes in each vQTL. Finally, we estimated the
genotype dependent viability at each vQTL and the combined effect of these vQTLs on the
viability of two families.
Viral load estimation
Viral load estimation showed as it can be seen on Figure 2.2 for family 12×52 pre-exposure
samples on average had ~40 times less viral load then post-exposure samples and family 58×19
pre-exposure samples had on average ~10 times less viral load then post-exposure samples. The
viral load in family 12×52 is statistically significantly higher than family 58×19. This is consistent
with the fact that family 12×52 and 58×19 experienced 55% and 16% mortality respectively. High
amount of viral load detected in the post-exposure samples clearly indicates that the mortality was
caused mainly due to OsHV-1 infection (Collen Burge, personal communication). We performed
viral load estimation, using two methods, first based on the read counts from sequencing data and
then based on qPCR. The distribution of viral loads in two families are very similar in both methods
but the QTL results are different. GBS based estimation did not yield any significant QTL whereas
qPCR based estimation yielded one significant QTL for the viral load.
35
QTLs for resistance
It has been shown that mass selection can be effective in reducing OsHV-1 caused
mortality significantly in previous studies. In one study four generations of mass selection is shown
to improve viral resistance against OsHV-1 up to 62% (Dégremont, Nourry, & Maurouard, 2015).
Another study found that resistance to viral caused summer mortality is under strong genetic
control and give rapid response to selection, with up to 10% gain in survival in each annual
selection cycle and the heritability (h
2
) is estimated to be 0.8 (Kube et al., 2018). These might
indicate that OsHV-1 resistance genes are found quite frequently in natural populations. Most QTL
studies employs an F2 family with inbred parents or a backcross design. If the resistance genes for
OsHV-1 caused mortality are present in high frequency in the natural populations it is possible to
use random-bred full-sib families (an family constructed from unrelated wild-caught parents), such
as those used in our study. Besides, since the parents used in our experimental design are not
inbred, this allow us to investigate the variation present in the natural populations. A downside of
our approach is that the high complexity of the genome structure of the parents makes it more
difficult, determining the exact causal loci for the QTL is difficult (Bohra, 2013). We estimated
genetic mortalities attributed to each QTL (Table 2.1 and Table 2.2). under the assumption of an
independent effect of each QTL, the combined genetic mortality from all QTLs can be calculated
by multiplying the genetic mortality of each QTL. This calculation yielded a higher mortality than
field mortality estimates, as reported in the results section, for both families. This is likely an
indication that QTL interact in their effect on mortality. The interactions of QTLs should be
investigated in order to get a reliable estimate of genetic mortality. Recently (Gutierrez et al.,
2018) reported QTLs based on a GWAS analysis from a laboratory challenge experiment. They
reported significant QTLs for survival and viral load on LG 6 and 8 respectively. Their survival
36
QTL is located on the same LG with one of the vQTLs detected in our family 58x19. It is important
to note that Gutierrez et al. (2018) used a SNP array for their QTL analysis. The use of array has
some advantages such as being able to call the exact same SNPs in every animal investigated. We
used reduced next generation sequencing approach using the GBS method, the main advantage of
our method is that we do not need a predetermined SNP array. Construction of SNP array might
be expensive and it must be constructed for each animal system independently. With the rapid
reduction of the NGS costs this approach has the potential to be beneficial for the oyster industry
and it allows to investigate previously unknown variations present in the natural populations.
Another advantage of NGS is that the pipeline implemented in this study can be applied to other
animal systems without the need to construct a SNP array or any other molecular tool. In another
study based on microsatellite markers (Sauvage et al., 2010), QTLs for survival and viral load
were reported on LGs 6 and 7 at locations 63 cM and 92 cM respectively. In family 58×19, we
found a survival QTL on LG 6 at location 76 cM. In families 12×52 and 58×19, we found a survival
QTL on LG 7 at locations 72 cM and 66 cM in our study. Given that these QTLs on LGs 6 and
7are detected in relatively close locations in two independent studies and QTL on LG 7 is found
in both of our families enhances the validity of these markers. One marker on LG 10 is detected to
be statistically significant in both families for survival (positions 86.15 cM and 56.72 cM for 12×52
and 58×19 respectively) and it is detected for viral load in family 12×52 (position 54.55 cM). This
makes this marker a strong candidate for resistance to OsHV-1
Functional annotation of QTLs
The errors in the Pacific oyster genome assembly reported previously (Hedgecock et al.,
2015) and confirmed in this study impede our ability to functionally annotate QTLs. We used a
37
reduced genome sequencing (GBS) in our study. Although it is possible that a QTL peak we
detected is causal for the viability, it is also possible that the QTL is near a causal marker that we
did not sample. In the ideal case, with the presence of a chromosomal level assembly with no
assembly errors, it is possible to perform a window search around the QTLs to look for potential
causal loci. Given the high degree of assembly errors at the scaffold-level, this procedure is not
possible unless a better genome assembly is available. As a future direction we are working
towards improving the quality of the oyster genome assembly by combining all the linkage map
information we have (Compendium list). This will allow us to functionally annotate our candidate
QTLs for OsHV-1 resistance.
38
Chapter 3 Investigating circatidal and circadian clocks in the intertidal habitat of Pollicipes
polymerus using whole transcriptome sequencing
Abstract
The North American gooseneck barnacle Pollicipes polymerus is a sessile marine
invertebrate that lives in the rocky intertidal habitat. As tides move up and down, this habitat
constantly and drastically changes throughout a day. During high tide, barnacles can feed and are
protected from direct sunlight. During low tide, as they cannot get nutrients, are subject to
desiccation, and exposed to direct sunlight and UV, which can potentially damage tissues. Since
barnacles cannot move, most likely response they can give to their ever-changing environment is
through the regulation of their gene expression in order to adapt to their habitat. Pollicipes
polyemerus is always found in the part of the rocky intertidal that is exposed to large and strong
waves, and they are not found below low-tide level. Their exclusive nature between the mid-
intertidal and low intertidal zone make them a suitable taxon to study adaptation to constantly
changing environments. In this study, we collected time series data from the field following a tide
cycle. Intertidal animals are known to have two types of internal clocks, a tidal clock and solar
clock. We collected our data for 24 hours to include times of high/low tide and day/night. Using
whole transcriptome sequencing of the cirri we generated a working hypothesis on how the gene
expression of P. polymerus is regulated in this habitat.
39
3.1 Introduction
Most animals are governed by an internal clock called circadian rhythm. This usually
follows the 24-hour cycle of the sun. This is a very important regulator for many organisms.
Circadian rhythm was studied for the first time by (Konopka & Benzer, 1971), they studied fruit
flies. They discovered three mutants with significantly altered 24-hour clocks associated with a
gene on the X chromosome. Since then huge number of studies are done on various aspects of the
circadian clock. Most of the groundbreaking work and advancement in the field of circadian clocks
happened in the studies of Drosophila melanogaster, also known as fruit fly. Zehring et al. (1984)
discovered that a mutation in the per locus in D. melanogaster is associated with the disruption of
circadian rhythms. Another study found that the same mutation can also have affect an ultradian
clock (a cycle that completed in less than 24 hours) (Hardin, Hall, & Rosbash, 1990). Confounding
effect of multiple biological clocks is a challenging scientific problem. It has been shown that there
is a high degree of sequence preservation among the biological clock related genes in multiple
organisms such as drosophila, mouse and human (Allada, White, So, Hall, & Rosbash, 1998). In
addition to the circadian rhythm, animals in the intertidal habitat also have an internal clock based
on the movement of the moon, which causes tides to rise and fall, this clock is called the circatidal
clock.. The relationship of this clock to the circadian clock and whether the lunar based clock is
an independent clock or it is dependent on the circadian clock is an ongoing debate in the field
(Cheeseman, Fewster, & Walker, 2017; Wilcockson & Zhang, 2008; Zhang et al., 2013). An in-
depth understanding of circadian and circatidal clocks and their effect on intertidal species’
capability to adapt and evolve in this habitat is crucial in understanding how these species are
adapted to their environment.
40
Phenotypic plasticity is the ability of individual genotypes to express diverse phenotypes,
by altering for example morphology or physiology, in response to changes in environmental
conditions (Strand & Weisner 2004). Phenotypic variability is commonly observed for quantitative
traits in organisms inhabiting heterogeneous environments (Via & Lande 1985), like the intertidal.
Part of that variability is determined exclusively by the environment (phenotypic plasticity), part
is strictly genetic (additive genetic variation) and part is due to the interaction between
environment and genotype. Since different populations of a species may experience distinct
environments, or distinct patterns of environmental variability, if gene flow is restricted, they will
develop local differentiation, which can lead to differences in the range of different phenotypes
that a given genotype can produce across environments, which is also called reaction norm, in each
population (DeWitt & Scheiner 2004). On the contrary, environmental variation can also lead to
fixed local genetic adaptation, through selection of the most suitable genotypes.
Barnacles (Cirripedia; Thoracica) are crustaceans that have greatly modified their body
structure to adapt to a sessile life. Unlike all the other Crustacea, they live permanently and
irreversibly attached to the substrate during adult life. However, they are very plastic species and
can be found in many different habitats (Anderson 1994). The gooseneck barnacle Pollicipes
polymerus is a sessile arthropod that inhabits intertidal zone of the rocky shores of the west coast
of North America, from British Columbia (Canada) to Punta Abreojos (Baja California). As a
sessile invertebrate, barnacles are filter feeders. They extend their cirri out to feed when the
conditions are favorable (high water flow caused by incoming tide and big waves. Some barnacle
species use an active feeding method (constantly moving their cirri to catch food), P. polymerus
mostly employs a passive feeding strategy, by relying on the strong wave action (Barnes & Reese,
1959). Despite being hermaphroditic, they reproduce via cross-mating with internal fertilization,
41
though sperm casting has been recently described in this species (Barazandeh, Davis, Neufeld,
Coltman, & Palmer, 2013; Barazandeh & Palmer, 2015). After a free-swimming larval stage, it
settles permanently on a hard substrate. It is found exclusively above the low tide level and never
in the subtidal zone (Barnes & Reese, 1960). There are other sessile invertebrates that share the
same habitat with P. polymerus. For example, the abundant and well-studied mussel Mytilus
californianus. Unlike P. polymerus, M. californianus is present both in the intertidal zone and in
the subtidal zone up to 24 meters in depth (Morris, Abbott, & Haderlie, 1981). This exclusive
nature of P. polymerus in the intertidal zone, separates this species from the rest and makes it a
good organism to study internal clocks based on lunar cycle (circatidal clock) in this habitat since
they are always subject to the changing environments between low and high tide. In addition P.
polymerus has been shown to perform better aerial respiration compared to aquatic respiration
(Petersen, Fyhn, & Johansen, 1974). During high tide they are covered with water, so they can get
nutrients and they are protected from UV light, during low tide they are exposed to the open air,
direct sunlight and they have no access to nutrients for up to 612 hours.
In this study we wanted to investigate the circatidal and circadian clocks of P. polymerus
by examining the gene expression changes in a 24-hour window. We extracted RNA from the
cirrus tissue. This is the organ used for feeding by by P. polymerus. Gene expression profile of gill
tissue in California ribbed mussel, Mytilus californianus is shown to follow an oscillating pattern
associated with circatidal cycle (Gracey et al., 2008). Given that barnacle cirrus is analogous to
mussel gill in terms of function we decided to focus on this tissue. Ideally a longer sampling
interval should be choosen to be able to observe multiple day/night cycles and more high/low tide
cycles. This is a pilot project performed with very limited funds. Simulating the conditions in
which P. polymerus lives is a challenging and very expensive endeavor. Our study focused entirely
42
on field sampling. Although we did not have any control samples, we collected triplicates sample
at every time point to be able to make comparison between time points by taking into account the
natural variation among the samples. We collected animals following a complete two complete
tidal cycle and one solar cycle and generated whole transcriptome profile to dissect the gene
expression patterns following these clocks. This project generated a proof of concept for the
understanding of regulation of gene expression in the intertidal habitat for P. polymerus and
generated the first de-novo transcriptome assembly for this species.
3.2 Materials and Methods
Barnacle collection and RNA extraction
Barnacles were collected on the west side of Catalina Island (Little Harbor) on 07/09/2014
– 07/10/2014 Sunrise was at 6:31am and sunset was at 8:07pm. Sampling started at 6:00 am and
continued every three hours until 3:00 am the next day (8-time points) (Figure 3.1). At each time
point, three barnacles were samples from the same location. Barnacles were removed from the
rocks and immediately flash frozen in liquid nitrogen and then transported to Wrigley Marine
Science Center (WMSC) for storage in a -80 °C freezer. The tide height at the time of the sampling
was recorded.
Transcriptome library preparation and sequencing
RNA was extracted from each individual animal using the Direct-zol™ RNA Microprep
from Zymo research (Catalog No: R2051). Each animal was carefully dissected and a small piece
of cirrus tissue was transferred in the Trizol reagent and then RNA was extracted following the
43
protocol. The extraction protocol was implemented entirely in a 4 °C room to hinder any RNA
degradation. One sample from time point 4 (3:00 pm) did not yield enough RNA, so it was
excluded from the experiment. Extracted RNA was checked for degradation in a Bioanalyzer 2100
(Agilent, Santa Clara, CA, USA). RNA quality was high for all samples, with RIN (RNA Integrity
Number) numbers varying between 7 and 10. This number indicates the integrity and quality of
the RNA and the numbers obtained are within the good integrity range based on the Bioanalyzer
2100 user manual. RNAseq libraries were prepared from 1-4 g of total RNA, using a Kapa
Stranded mRNA-Seq Kit for Illumina platforms (Kapa Biosystems, Inc., Wilmington, MA, USA),
which isolates mRNA via poly(A) capture. Final library products were quantified using the Qubit
2.0 Fluorometer (Thermo Fisher Scientific Inc., Waltham, MA, USA), and the fragment size
distribution was determined with the Bioanalyzer 2100. The libraries were then pooled
equimolarly, and the final pool was quantified via qPCR, using the Kapa Biosystems Library
Quantification Kit, according to manufacturer’s instructions. The pool was sequenced in an
Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA), in High-Output paired-end 125
cycles format. The preparation of the libraries and the sequencing was performed at the USC
Genome Core (University of Southern California, Los Angeles, CA, USA).
De-novo transcriptome assembly
The transcriptome was assembled de-novo using the Trinity software (v2.1.1) (Haas et al.,
2013). Fastq files obtained from the sequencing run (23 paired-end samples in total) are trimmed
to remove sequencing adapters and low quality reads, using trimmomatic (Bolger et al., 2014).
After trimming, as suggested in Trinity guidelines, all fastq are merged and fed into the Trinity
assembler. Trinity was run with the parameter --normalize_reads to use the digital normalization
44
before the assembly. Digital normalization regulate the sequencing coverage and decreases
sampling variation this removed redundancies in data and removes most the sequencing related
errors (Brown, Howe, Zhang, Pyrkosz, & Brom, 2012). All other parameters were kept in the
default values. The assembler is run on high performance computing platform of USC using 128
GB of memory.
Mapping and feature counting
Read mapping and feature counting was performed, following the suggested downstream
analysis protocol in Trinity. The sequencing reads for each individual sample were aligned to the
transcriptome assembly using Bowtie2 (Langmead & Salzberg, 2012) with the following
parameters: --no-mixed --no-discordant --gbar 1000 --end-to-end -k 200. The abundance of each
transcript, measured as read counts, was estimated using RSEM (B. Li & Dewey, 2011).
Transcriptome annotation and filtering
Transcriptome annotation was performed using the guidelines of the Trinotate pipeline,
which is part of the Trinity software. We downloaded the public protein database from UniProt
(http://www.uniprot.org). Blast software (Altschup, Gish, Miller, Myers, & Lipman, 1990) was
used. Blast software suite contains two options, blastx, to search the protein database with a
translated nucleotide query, and blastp, to search the database with a protein query. First, blastx
was run using the fasta file generated from Trinity as input with parameters -max_target_seqs 1 -
outfmt 6 -evalue 10. Then “Transdecoder” module of Trinity was used to predict the most likely
open reading frame from the fasta file generated by Trinity. The outout of the Transdecoder was
used as input to blastp which was run with the same parameters. The annotation database was
45
generated using the Trinotate functions LOAD_swissprot_blastp and LOAD_swissprot_blastx. A
custom script was used to filter out any unannotated transcript (including any rRNA transcripts)
and to keep only transcripts annotated as “Metazoa”. The R package edgeR (Robinson, Mccarthy,
& Smyth, 2010) was used to calculate counts per million (cpm) for the read counts of each sample.
If a transcript had a cpm of less than 1 in more than 12 samples it was discarded. Transcripts that
survived the annotation based filtering and mapping based filtering were used in downstream
analyses.
Detecting genes associated with the tidal clock (circatidal rhythm)
After the filtering as described above read counts are normalized using edgeR (Robinson
et al., 2010) function calcNormFactors(). Normalized read counts are used as input for principal
component analysis (PCA). Figure 3.1 shows the sampling scheme, tide heights (bottom panel)
and the principal component 1 (PC1) (top panel). PC analysis was calculated from a linear
transformation of all 6,339 genes, every gene has a contribution to the value of PC1 this
contribution was called the loading of that gene. Red dashed horizontal line showed the sampling
location which was at 2.6 feet. When the tide height was below the red line, barnacles are exposed.
If tide was above the red line, barnacles were covered under the water. Between 6am and 9am tide
was coming up and barnacles went from exposed to covered state and we can see that PC1 went
up as well. Then between 9am and noon tide moved down and barnacles went from covered to
exposed and PC1 went down. Between Noon and 3pm barnacles are still exposed so PC1 continue
in the downward trend. At 3pm barnacles are partially covered but the tide was very close to the
sampling site, barnacles were starting to be covered by water and were starting to feel the waves
and then tide went up until 6pm barnacles go from exposed to covered and again PC1 keep
46
following the tide and went up. Between 6pm and 9pm barnacles were covered but at 8:07pm we
have sunset and we see a light decrease in PC1 and then it stays constant through the night. We
also observe a high variation in between the replicate samples. Although tide was coming down
between 9pm and midnight and barnacles went from covered to exposed PC1 does not follow this
pattern at night. Based on the observation that PC1 roughly follow the tidal cycle, we decided to
investigate genes correlated and anti-correlated with PC1. The correlation of the expression of
each gene and the PC1 was implemented using the R package FactoMineR, with the function
dimdesc(). This function calculated the correlation of each variable that was used in the principal
component analysis and gives the correlation coefficient and p-value of the correlation. Since this
was calculate for each gene independently, the resulting p-values were corrected using the R
function p.adjust() with the parameter “Bonferroni”. At the end of this analysis a list of genes
correlated (positive association) with PC1 and a list of genes anti-correlated (negative association)
with PC1 were detected.
Detecting genes associated with solar clock (circadian rhythm)
We divided all samples in two groups, depending on whether they were taken during the
day, or during the night. In the Daytime group, there were five timepoints, from 6 am to 6 pm,
with a total of 14 samples (as explained above, one of the replicates in timepoint 4 did not yield
any RNA). In the Nighttime group, there were three timepoints, from 9 pm to 3 am, with a total of
nine samples. Between these 2 groups, we performed a differential gene expression analysis using
the R package edgeR (Robinson et al., 2010). This package fits a generalized linear model (GLM)
to perform negative binomial regression to each gene. Statistical testing was done with the
exactTest function of edgeR, and obtained p-values were using R p.adjust() function with the
47
parameter “fdr”. Transcripts were considered as differentially expressed if the corrected p-value
was smaller than 0.05 and the log2 fold change was bigger than 1 (upregulation) or smaller than -
1 (downregulation).
Differential gene expression between Covered and Exposed samples
Samples are divided in two groups, based on their positions which respect to tide. Samples
from 9:00, 15:00, 18:00,21:00 were under the water at the time of sampling. These samples are
labelled as “covered”. Samples from 6:00,12:00,0:00 and 3:00 were out of the water during
sampling. These samples are labelled as “exposed”. Between these 2 groups, we performed a
differential gene expression analysis using the R package edgeR (Robinson et al., 2010). This
package fits a generalized linear model (GLM) to perform negative binomial regression to each
gene. Statistical testing was done with the exactTest function of edgeR, and obtained p-values were
using R p.adjust() function with the parameter “fdr”. Transcripts were considered as differentially
expressed if the corrected p-value was smaller than 0.05 and the log2 fold change was bigger than
1 (upregulation) or smaller than -1 (downregulation).
Differential gene expression between consecutive time points
Differential expression analysis between each consecutive time points was performed
using the R package edgeR (Robinson et al., 2010). exactTest was run between each time points
and obtained p-values are corrected using R p.adjust() function with the parameter “fdr”.
Transcripts were considered as differentially expressed if the corrected p-value was smaller than
0.05 and the log2 fold change was bigger than 1 (upregulation) or smaller than -1 (downregulation).
48
Investigating the interaction of the tidal and solar clocks
In order to look for the interaction of circadian and circatidal clocks we implemented an
analysis of variance (ANOVA) for the expression of each gene. Since RNA-seq data is usually
considered to be distributed with a negative binomial distribution (Robinson et al., 2010) in order
use ANOVA first the gene expression needs to be transformed. One of the common
transformations used for RNA-seq data is Voom method (Law, Chen, Shi, & Smyth, 2014). In
summary this method calculates the log-cpm transformation of the RNA-seq data, after the
transformation any linear model such as ANOVA can be applied to the data. The following model
was fit
𝐸 ~𝐷 + 𝐶 + 𝐷 ∗ 𝐶
where
E: Voom transformed gene expression
D: Binary variable either Day or Night
C: Binary variable either Covered or Exposed
D * C: The interaction of D and C
Obtained p-values from manova are corrected using R function p.adjust() with the
parameter “fdr”
Gene ontology analysis
In order to detect the biological processes activated based on the candidate gene list we
obtained from the above-mentioned analyses, we performed gene ontology (GO) enrichment
analysis. GO enrichment analysis was performed for candidate gene lists from the following four
lists: genes significantly correlated with tide, genes significantly anti-correlated with tide, genes
49
upregulated during night and genes upregulated during day using the R package GOseq (Young,
Wakefield, Smyth, & Oshlack, 2010). The R script to run GOseq was modified from the GOseq
script provided within Trinity de-novo genome assembly software. All genes passing the filtering
(6339 genes) were used as the background set for the GO analysis. The GO terms associated with
every gene in our de-novo assembly was predicted using the TRINONATE pipeline implemented
as an add-on to the Trinity software. GO enrichment analysis implemented as in GOseq checks if
any of the genes in our candidate gene list was enriched for any GO category. There are three main
classes for gene ontologies. Biological process, molecular function and cellular component. In this
study, we focused on the GO terms classified as Biological Process. One GO term can be
associated with multiple biological functions and there is a hierarchical relationship between GO
terms. In order to remove the redundancies and generate an interpretable list of GO annotations
we filtered the enriched GO terms with REVIGO (Supek, Bošnjak, Škunca, & Šmuc, 2011) with
similarity score 0.4. Directed acyclic graphs (DAG) are constructed for each GO term, after
filtering, using QuickGO (Binns et al., 2009) to investigate the relationship of the GO terms with
the parent terms and with each other to dissect the biological processes explained in our list of
candidate genes.
3.3 Results
De-novo transcriptome assembly, filtering, and read counting
Twenty-three samples were run on Illumina 2500 sequencer in the 125bp paired end mode.
A total of 488,631,952 reads were produced. All of these reads are pooled together for the Trinity
assembly which resulted in 330,522 putative transcripts. We checked the mapping support for
those putative transcripts by mapping the sequencing result of each individual sample back to the
50
transcriptome assembly. Assembled genes are filtered based on mapping support and annotation.
37,270 of these genes mapped back to the transcriptome assembly and were annotated as Metazoa.
After filtering by cpm threshold of 1 in at least half of the samples (12 samples) we obtained a
final set of 6,339 transcripts. These final set was used for all downstream analyses.
Genes associated with tidal cycle
Nine hundred and thirty-six genes were found to be positively associated with the first
principal component (Figure 3.1A) based on the Pearson correlation coefficient test. These genes
were following the tidal cycle, and their expression goes up when the tide is going up and goes
down when the tide is going down. Twenty-eight genes were found to be negatively associated
with the first principal component based on the same test (Figure 3.1A). The expression level for
these genes follows the tide in the opposite direction, i.e. they go up in expression when the tide is
going down.
51
Figure 3.1 Sampling scheme (A) Average principal component 1
for each time point. PC1 explains 34% of the variation. Error bars
shows the standard deviations (B) Tide height (feet) for each time
points. Horizontal dashed line shows the sampling location.
Samples below this line are covered with water. Samples above this
line are exposed. Vertical line marks the sunset at 20:07. PC1
oscillates following the tide height. Genes correlated and anti-
correlated with PC1 are analyzed.
Genes associated with solar clock
Twenty-eight genes were found differentially expressed between day time and night time
samples. Twenty genes were upregulated during night time and eight genes were upregulated
during day time. (Figure 3.2)
52
Figure 3.2 Daytime vs Nighttime differential expression
analysis. Horizontal red line shows the statistical significance
level α = 0.05. Horizontal lines show the fold change thresholds
-1 on the left and 1 on the right. 20 genes are upregulated at
night and 8 genes are upregulated at day time.
Covered vs Exposed analysis
Total of four genes are found to be significant. One upregulated gene at exposed and three
upregulated gene at covered conditions are detected. The annotation for the gene upregulated at
exposed is a Tubulin beta chain. The three genes upregulated as covered condition are annotated
as, lipid transporter activity, Zinc finger and SCAN domain-containing protein 2 and Calcium-
transporting ATPase 8, plasma membrane-type.
53
Figure 3.3 DE Exposed vs Covered Horizontal red line shows
the statistical significance level α = 0.05. Horizontal lines show
the fold change thresholds -1 on the left and 1 on the right. 1
gene is upregulated in the exposed group and 3 genes are
upregulated in the covered group
Investigating the interaction of the tidal and solar clocks
Three variables are used for the anova. Day/night status, covered/exposed status and the
interaction of these two variables. All assembled genes (6339 genes) after filtering were used as
input to this model. The variable day/night status is found be significant in 604 genes, the variable
covered/exposed status is found to be significant in 538 genes and the interaction term is found to
be significant in 262 genes. These are the results before the application of multiple testing
correction, after multiple testing correction no variable is found to be significant.
54
GO analysis results
In the four categories, positive association with tide, negative association with tide,
upregulated during night and upregulated during day, of genes investigated, we found 144, 13, 54,
and 33 GO terms respectively. After filtering with REVIGO we got 30, 4, 12, and 6 GO terms
respectively. Analysis of DAGs generated from the filtered GO terms revealed the following
results. For the genes positively associated with the tide, 20 out of the 30 significantly correlated
GO terms are associated with "Metabolic activity". The rest of the terms are associated with
cellular processes (Figure 3.4). 20 GO terms can be seen on the left size of Figure 3.4 under the
metabolism category. All GO terms are given on Table 3.1
55
Table 3.1 Gene ontology terms from genes correlated with with
tide
Term Biological Process
Genes
correlated
with tide
Number of
genes in the
category
P-value
GO:0044260 cellular macromolecule metabolic process 449 2297 4.3E-11
GO:0006396 RNA processing 114 397 1.3E-10
GO:0044237 cellular metabolic process 551 3057 4.3E-09
GO:0043170 macromolecule metabolic process 473 2564 9E-08
GO:0043412 macromolecule modification 239 1103 2.9E-07
GO:0008380 RNA splicing 59 184 2.9E-07
GO:0016071 mRNA metabolic process 83 296 3.8E-07
GO:0044238 primary metabolic process 538 3051 1.3E-06
GO:0008152 metabolic process 587 3415 1.3E-06
GO:0071704 organic substance metabolic process 555 3190 4E-06
GO:0006807 nitrogen compound metabolic process 509 2887 4.6E-06
GO:1901360 organic cyclic compound metabolic process 323 1677 5.3E-06
GO:0046483 heterocycle metabolic process 307 1584 6E-06
GO:0006725 cellular aromatic compound metabolic process 308 1612 2.2E-05
GO:0034641 cellular nitrogen compound metabolic process 327 1771 3.5E-05
GO:0016569 covalent chromatin modification 61 228 0.00039
GO:0044265 cellular macromolecule catabolic process 71 270 0.00085
GO:0032268
regulation of cellular protein metabolic
process
144 673 0.00104
GO:0015931 nucleobase-containing compound transport 35 101 0.00123
GO:0009987 cellular process 755 4711 0.00206
GO:0006913 nucleocytoplasmic transport 32 95 0.0039
GO:0033044 regulation of chromosome organization 37 121 0.00703
GO:0018195 peptidyl-arginine modification 8 15 0.01913
GO:0046794 transport of virus 10 18 0.02518
GO:0035072
ecdysone-mediated induction of salivary gland
cell autophagic cell death
7 12 0.02846
GO:0060033 anatomical structure regression 8 16 0.03
GO:0007077 mitotic nuclear envelope disassembly 7 10 0.03523
GO:0032259 methylation 30 104 0.03843
GO:0035601 protein deacylation 10 21 0.04331
GO:0098732 macromolecule deacylation 10 21 0.04331
56
Figure 3.4 Positive Association with tide 30 GO Terms detected
GO terms are shown in yellow
57
In the second category, genes associated negatively with the tide, we found terms related
to unsaturated fatty acid metabolism, NADP biosynthetic process and reactive oxygen species
metabolism (Figure 3.5). All 4 GO terms are under the broad category of cellular metabolic
process. Number of genes found for each GO terms and the the enrichment p-value is given on
Table 3.2
Table 3.2 Negative Association with tide
Term Biological Process
Genes anti-
correlated with
tide
Number of genes
in the category P-value
GO:0006741 NADP biosynthetic process 4 4 9.74E-08
GO:0072593 reactive oxygen species metabolic process 5 43 0.001342
GO:0042744 hydrogen peroxide catabolic process 4 27 0.004364
GO:0033559 unsaturated fatty acid metabolic process 3 17 0.049821
58
Figure 3.5 Negative Association with tide for four, GO Terms
detected (yellow).
The third category, genes upregulated during day, generated terms related to regulation of
cytokine production, regulation of T cell migration, negative regulation of lipoprotein and
regulation of smooth muscle cell matrix (Figure 3.6). Number of genes found for each GO terms
and the the enrichment p-value is given on Table 3.3
Table 3.3 Gene ontology terms from genes upregulated during
day time
Term Biological Process
Genes
upregulated
during day
Number of
genes in the
category
P-value
GO:0014012 peripheral nervous system axon regeneration 2 3 0.0046
GO:0050748 negative regulation of lipoprotein metabolic process 2 3 0.0046
GO:2000097 regulation of smooth muscle cell-matrix adhesion 2 3 0.0046
GO:2000404 regulation of T cell migration 2 3 0.0046
GO:1900015
regulation of cytokine production involved in inflammatory
response
2 3 0.0046
GO:0048662 negative regulation of smooth muscle cell proliferation 2 5 0.0159
59
Figure 3.6 Upregulated at Day 6 GO Terms detected GO terms
are shown in yellow
60
The last category, genes upregulated during night, generated GO terms related to regulation
behavior, regulation of actin-filament based process and regulation of muscle cell differentiation
(Figure 3.7). Full list of GO terms with the associated enrichment p-values are given in Table 3.4.
Very small smaller than 10
-8
are given as zero in the table
Table 3.4 Gene ontologies from genes upregulated at night
time
Term Biological Process
Genes upregulated
at night time
Number of genes in
the category
P-value
GO:1902905
positive regulation of
supramolecular fiber organization
9 79 0
GO:0051302 regulation of cell division 8 60 0
GO:0051147
regulation of muscle cell
differentiation
9 50 0
GO:0018107 peptidyl-threonine phosphorylation 8 45 0
GO:0008344 adult locomotory behavior 8 60 0
GO:0032970
regulation of actin filament-based
process
9 149 8E-08
GO:0044087
regulation of cellular component
biogenesis
10 334 9E-06
GO:0007610 behavior 9 295 7E-05
GO:0018193 peptidyl-amino acid modification 8 289 0.0004
GO:0032501 multicellular organismal process 14 1279 0.0011
GO:0016310 phosphorylation 8 328 0.0013
GO:0006793 phosphorus metabolic process 10 654 0.0031
61
Figure 3.7 Upregulated at Night 12 GO Terms detected GO terms are shown in yellow
3.4 Discussion
In this study, we investigated temporal patterns of gene expressions related to the circatidal
(lunar based) and circadian rhythms (solar based) in the intertidal habitat using P. polymerus as
the model organism. We used we used PCA (principal component analysis) and differential gene
expression analysis based on the whole transcriptome sequencing of the cirrus tissue. Intertidal
animals’ gene expression is governed by two clocks: tidal clock (circatidal) and solar clock
(circadian) (Zhang et al., 2013). To understand the circatidal and circadian clocks, we investigated
genes correlated/anti-correlated with the tide levels and genes differentially expressed between
day/night respectively. Genes positively correlated with the tide level revealed GO terms mostly
associated with metabolism. This is consistent with the fact that barnacles received their nutrients
when they were covered with water. As the tide was going up the metabolism of barnacles is going
up. These results are consistent with the findings of Gracey et al. 2008. In their study M.
62
californianus’ gene expression has been studied in a similar intertidal habitat. They performed both
field experiments and laboratory trials, although they used microarrays to measure the expression
level of each gene, their downstream analysis is similar to the methods implemented in our study.
They performed a PCA on the gene expression profile of the gill tissue (mussels use gills to feed
which is similar to the role of cirri in barnacles) using a similar approach to ours by comparing the
principal components of the gene expression profile to the tide heights. They have found that first
principal component explained 28% of the variation, in our study first principal component
explained 34% of the variation. Similarly, for genes that are positively correlated with the tide they
reported genes for which the molecular function is mainly related to metabolism. Decoupling of
circadian can circatidal clock is a very challenging problem. We attempted to solve this problem
in our analysis by fitting an anova to the gene expression value of each gene by using the day/night
status, covered/exposed status and the interaction term. Unfortunately, our model did not yield any
statistical significance after multiple testing correction. This might be since gene expression values
need to be normalized before applying an anova. We implemented this normalization process using
the R function voom which is part of the limma package (Law et al., 2014). It is possible that this
method has created some problem. This needs further investigation by exploring other
normalization methods or by directly applying a negative binomial generalized linear model with
negative binomial regression directly to the gene expression values without normalization.
Genes anti-correlated with the tide level mostly falls in the category of reactive oxygen
species (ROS) metabolism. When tide is falling barnacles are exposed to the direct sunlight and
open air. Our results can be explained by two factors: damage due to UV light and aerial
respiration. When intertidal animals are exposed to direct sunlight and desiccation, this is known
to cause tissue and cell damage and elicit a transcriptomic response, induce reactive oxygen species
63
and generate H2O2. (Collén, Guisle-Marsollier, Léger, & Boyen, 2007; Contreras-Porcia, Thomas,
Flores, & Correa, 2011). P. pollicipes is capable of gas exchange both inside and outside the water.
It has been shown that gas exchange outside the water is more efficient (Petersen et al., 1974).
This increased efficiency of gas exchange, while exposed to air, might explain the detection of
ROS related GO terms for the genes anti-correlated with the tide level.
Genes upregulated during daytime are found to be associated with cytokine production,
regulation of T cell migration. These are molecular functions related to immune response. It has
been shown in multiple invertebrate systems that animals can elicit immune response to abiotic
factors such as heat shock and oxidative stress. (Hatanaka, Sekine, Hayakawa, Takeda, & Ichijo,
2009; Muralidharan & Mandrekar, 2013). Barnacles exposed to direct sunlight and desiccation
might experience heat and oxidative stress, and this might explain the biological processes we have
detected for genes upregulated during daytime. Heat and oxidative stress response are part of the
circatidal rhythm.
In the last group of genes investigated in our study, those upregulated during nighttime,
genes fall into the category of actin-filament based processes and muscle cell differentiation. These
results might indicate that barnacles are experiencing a lot of muscle movement at nighttime. We
hypothesize that, these muscle movement can be due to feeding activity or mating behavior.
Barnacle feeding behaviors is determined by the water flow and wave strength. (Barnes & Reese,
1959). Highest tide level of the day was measured at 9:00pm in our experiment. High tide also
means higher water movement and potentially bigger waves. This will likely bring a lot of nutrients
and stimulate barnacles to feed which might result in increased feeding. Some barnacles species,
such as Tetraclita, have been shown to be nocturnal feeders (Pasternak, Achituv, Mina, &
Goodman, 2007). This is an advantageous strategy because barnacles need to open up and the cirri
64
need to be pulled out in order to feed, performing this during nighttime decreases the risk of
predation. It is important to note that our analysis focused on the expression of the cirrus tissue.
The expression profile of this tissue may not be directly related to these behaviors. Gene expression
profile of other tissues should be investigated with further functional experiments to have a better
understanding.
A word of caution should be exercised when interpreting the results from gene ontologies.
All our transcripts are assembled de-novo and annotated using public databases. This is a limitation
because all the functional knowledge is derived based on the sequence homology to other
organisms. Most of the biological research focuses on model organism and most of the
experimentally verified functional annotations in the public databases are from studies done in
model organisms such as fruit flies and mouse. Gene ontology consortium contains approximately
200 million entries (Huntley, Sawford, Martin, & O’Donovan, 2014). There are two ways of
generating annotations for gene ontologies, manual annotation and automatic annotation. Manual
annotation is performed by an experienced curator who manually investigates the literature to find
the functional annotation for a given gene ontology. Automatic approaches utilizes computer
algorithms based on sequence similarity, orthology and sequence domains (Burge et al., 2012;
Dimmer et al., 2012). It is important to note that almost 99% of gene ontology annotations are
automatically generated (Huntley et al., 2014). Despite these limitations gene ontology analysis
have the benefit of combining the functional annotation from many organisms to obtain a
structured way to describe biological processes (Yon Rhee, Wood, Dolinski, & Draghici, 2008).
Although there are major limitations with the application of gene ontology analysis and any results
obtained should be taken with caution, this method allows us to transfer the functional knowledge
derived from well-studied model organisms, to our study system, P. polymerus.
65
In this study we investigated how P. polymerus regulates gene expression in the ever-
changing intertidal habitat. Our study has many limitations. We only sampled one 24-hour cycle
and only one location on the Catalina Island. In a single 24-hour cycle circadian and circatidal
clocks are confounded, longer sampling periods or labarotory experiments are required in order to
decouple these two clocks. Main reason for having these limitations in this project is the lack of
funding. This is a proof of concept work that generated a working hypothesis on the regulation of
gene expressions within a 24-hour period for P. polymerus Working hypothesis is that P.
polymerus has an increased metabolic activity during day when they are covered with water. When
exposed to direct sunlight, they are experiencing oxidative stress and elicit genes responsible for
ROS metabolism. Their muscle movement increases at night, which might indicate increased
feeding or mating activity. This hypothesis needs to be tested with a better sampling design and
with the addition of laboratory trials which includes biochemical and physiological experiments
to obtain a complete understanding of the mechanisms employed by P. polymerus in terms of gene
expression regulation as a response to lunar and solar cycles.
66
Chapter 4 NGS to the rescue: the power of whole genome approaches to detect fine-scale
population structure
Abstract
Population genetics studies have been typically limited by the type and number of
molecular markers available. Next-Generation Sequencing (NGS) technologies have shifted the
field's perspective, allowing the surveillance of thousands of markers throughout the genome, and
opening the possibility of detecting subtle population structuring patterns that would be otherwise
impossible to detect with conventional molecular markers. In the case of the North American
stalked barnacle (Pollicipes polymerus), previous studies have reported contradictory results
regarding its population genetic structure across the California Transition Zone (CTZ), a well-
established biogeographic boundary. In this work, we have generated for the first time a draft of
the genome of P. polymerus, and have used that as a reference to identify more than 400,000 SNPs
in several populations from Southern California (USA) following a pooled sequencing approach.
We have detected a subtle, but significant, level of genetic differentiation between the populations
analyzed, suggesting a pattern of isolation by distance along the coastal populations, rather than a
sharp phylogeographic break across the CTZ. In addition, we show that P. polymerus barnacles
from Catalina Island (California, USA) are much more genetically differentiated, and propose that,
contrary to a previous hypothesis, optimal brooding season, and not water temperature, might be
a more important factor in causing such genetic divergence. Finally, we highlight the importance
of making genomic resources available for P. polymerus, an iconic intertidal species, with
important commercial value.
67
4.1 Introduction
The development of next-generation sequencing (NGS) technologies has significantly
improved sequencing throughput, reduced costs, and advanced research in many areas, including
evolutionary biology (Jones & Good, 2016; McCormack, Hird, Zellmer, Carstens, & Brumfield,
2013). Population genetics studies have been typically limited by the type and number of molecular
markers available. Allozymes, mitochondrial DNA, microsatellites, and SNP genotyping, all have
limitations in their capacity to accurately describe genetic variation and evolutionary processes in
natural populations (Allendorf, 2017). The advent of NGS technologies in recent years has shifted
the perspective, allowing the comparison of very large genomic regions between populations.
These genome-wide approaches allow the fine mapping of genetic differences between
populations (Hohenlohe et al., 2010), and have the power to unravel evolutionary patterns that
would be nearly impossible to detect with conventional markers (Locke et al., 2011; Miller et al.,
2012). NGS data can also help in resolving contradictory hypothesis formulated using
conventional molecular markers (Wang et al., 2013).
Several studies, using different types of molecular markers, have reported contradictory
results regarding the population genetic structure of the North American stalked barnacle,
Pollicipes polymerus, across Southern California. P. polymerus inhabits the rocky intertidal of the
western North American shores, from Baja California to Alaska (R. H. (Robert H. Morris, Abbott,
& Haderlie, 1980), spanning across the California Transition Zone (CTZ), a well-established
biogeographic boundary (Newman, 1979). Whether the CTZ is also acting as an intraspecific
phylogeographic barrier has been the object of much debate in recent years (Burton, 1998;
Dawson, 2001; Kelly and Palumbi, 2010; Eberl et al., 2013). Cimberg (1981) describes the
existence of two physiological races of P. polymerus, separated by the CTZ at Point Conception
68
(California). These two races are defined by the water temperature at which they brood. According
to Cimberg (1981), the northern race would be composed of two types of animals, brooding at the
same temperature, but at different seasons: type 1 barnacles, found north of Point Conception, with
a brooding peak during the summer, and type 3 animals, found at Santa Catalina Island
(California), brooding mostly during the winter. Type 2 animals, found at Latigo Point
(California), would correspond to the southern race. Cimberg (1981) argues that the presence of
type 2 and type 3 barnacles in Southern California, exposed to the same water temperatures, and
yet brooding at different times of the year, evidences the existence of genetic differences between
the two physiological races. Kelly and Palumbi (2010) analyzed cytochrome c oxidase I (COI)
mtDNA sequences from several populations of P. polymerus along the entire range of distribution,
and found significant levels of genetic differentiation between populations located north and south
of Point Conception. On the contrary, Van Syoc (1994) did not find any genetic differences
between P. polymerus populations from north and south of Point Conception in an analysis of both
allozymes and COI sequences. Miner (2002) did not find differences between northern and
southern populations either in an analysis of eight allozyme loci, suggesting that the two
physiological races described by Cimberg (1981) may not be genetically divided. In addition,
although Miner (2002) recognizes that there might still be genetic differences in other parts of the
genome, he points out that this would require that gene flow be restricted, a hypothesis that he
rejects arguing that there exist several mechanisms that could promote gene flow across the CTZ,
like El Niño Southern Oscillation event. Miner (2002) also rejects the idea of a potential
reproductive isolation between the two races because even though they differ in the temperature
of their maximum brooding activity, at least the southern race barnacles brood throughout the year,
69
indicating they are capable of brooding also at lower temperatures, like those experienced by the
northern populations.
All these studies are limited by the type and number of molecular markers employed, and
may not have enough resolution power to identify the potential presence of genetic differences
among P. polymerus populations. We decided to use NGS technologies to first obtain a draft of
the genome of P. polymerus that can be used as a reference, and then generate whole-genome
pooled sequencing data for several Californian populations from the northern and the southern
sides of the CTZ boundary, and including the three types of brooders described by Cimberg (1981).
The locations surveyed for this work are: Pismo Beach (Type 1), Latigo Point (Malibu)(Type 2),
Santa Catalina Island (Type 3), and San Diego.
Using these datasets, we aim to resolve the controversy regarding the potential genetic
differentiation across the CTZ in the North American stalked barnacle, P. polymerus, and between
the associated physiological races, and shed light in the debate over the presence or not of a
phylogeographical barrier at Point Conception. More specifically, the goals of the present work
are: i) generate a genome draft of P. polymerus that can be used as a reference for the identification
of SNPs and read mapping; ii) test for genetic differences between P. polymerus populations from
both sides of the CTZ boundary, which correspond to the northern and the southern physiological
races described by Cimberg (1981); iii) test for genetic differences between the three brooding
types described by Cimberg (1981).
70
4.2 Materials and Methods
Barnacle collection and DNA extraction
For the generation of a reference sequence of the P. polymerus genome, one adult
individual was collected from the Santa Monica pier (Santa Monica, CA, USA) on March 2013,
and kept in 70% ethanol during transportation to the lab. After the first round of libraries were
prepared (see below), the reference tissue and DNA were accidentally lost, so a second adult
barnacle was collected from the same location on November 2013.
For the pooled population analysis, we collected P. polymerus specimens from four
locations across southern California on July 2013: Pismo Beach (35° 8' 53.41" N 120° 39' 8.0" W),
Malibu (34° 1'50.31" N 118°45'0.50" W), Santa Catalina Island (33°23'6.05" N 118°28'30.47" W)
and San Diego (32°50'49.41" N 117°16'46.88" W) A total of 60 barnacles were collected in each
location, and kept in re-sealable zipper storage bags immersed in 70% ethanol for transportation
to the lab.
In all cases the DNA was extracted following the same methodology: tissue samples of 1-
5 mg were extracted from the peduncle of each specimen, and DNA was extracted using a Zymo
Quick g-DNA kit (Zymo Research Corp., CA, USA). Extraction success and DNA degradation
were visually inspected in 0.7% agarose gels. DNA concentration was measured using a Qubit 2.0
Fluorometer (ThermoFisher Scientific, Waltham, MA, USA).
De-novo genome assembly
One of the two reference individuals were used to generate three sequencing libraries,
compatible with Illumina platforms. The libraries were constructed following Dunham and Friesen
71
(2013), using different indexes, then pooled altogether, and sequenced in two lanes of an Illumina
HiSeq 2500, in High-Output ver. 3 mode.
The other reference individual was used to generate a mate pair library using an Illumina
Nextera Mate Pair Library Prep Kit (Illumina Inc., San Diego, CA, USA) with an average insert
size of 2-8 kilobases (kb). This library was sequenced across two lanes of the HiSeq 2500 in 100
bp paired-end format.
The Fastq files containing the raw sequencing reads were inspected with FastQC (Andrews,
2010) to check for average quality, and adaptor content. The reads were then trimmed based on
quality, and adaptor sequences were removed using Trimmomatic ver. 0.32 (Bolger et al., 2014)
with default parameters. Reads from the mate pair library were further inspected and filtered using
NextClip (Leggett, Clavijo, Clissold, Clark, & Caccamo, 2014). Only reads with at least one
junction adapter present on either of the pairs were used in the assembly. The resulting high-quality
trimmed reads were used for de-novo assembly with SOAPdenovo2 (Luo et al., 2012), using
default parameters. The reads from the three sequencing libraries prepared from the first reference
individual were used in the first step of the assembly for contig generation. The reads within the
mate pair library were used for the scaffolding step.
Pooled library preparation and sequencing
For each of the four locations surveyed (Pismo Beach, Malibu, Santa Catalina Island, and
San Diego), we prepared two pools of 30 individuals (240 individuals in total), containing the same
amount of DNA from each barnacle. These eight pools were used to prepare whole-genome
libraries for Illumina sequencing. The libraries were prepared following Dunham and Friesen
(2013), and sequenced on a HiSeq 2500 on four lanes of a ver. 4 high-output flowcell.
72
All reference and pooled whole-genome libraries were processed and sequenced at the UPC
Genome Core (University of Southern California, Los Angeles, CA, USA). The libraries were
quantified using the Qubit 2.0 Fluorometer, and the fragment size distribution was determined with
an Agilent Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA). All libraries were quantified via
qPCR using the Kapa Biosystems Library Quantification Kit (Kapa Biosystems, Wilmington, MA,
USA), according to manufacturer’s instructions. As mentioned above, the sequencing was
performed in an Illumina HiSeq 2500 platform.
Mapping and SNP calling
Raw Illumina reads obtained from the sequencing of the pooled libraries were trimmed
based on quality, and adaptor sequences were removed using Trimmomatic version 0.32 (Bolger
et al., 2014). Trimmed reads were mapped to the genome assembly with BWA (Li & Durbin, 2009)
using the MEM algorithm with default parameters. The resulting alignments were sorted with
SAMtools (Li et al., 2009) and PCR duplicates were removed with Picard tools
(https://github.com/broadinstitute/picard).
A pileup file was generated using the mpileup command of SAMtools (using the -A option
to include orphan reads). This pileup file was used as input for the population genomics analysis
described in the next section.
Population genomics analysis
For the analysis of the pooled population sequencing data, we discarded all fragments
shorter than 1,000 bp from the reference assembly, and used only those positions with a coverage
of at least 60X (equal to the number of individuals in each pool). For positions with a coverage
73
higher than 60X, we randomly subsampled to 60X. We used PoPoolation2 (Kofler, Pandey, &
Schlötterer, 2011) to calculate allele frequencies for all the SNPs that passed the coverage filter in
each population. To estimate the level of genetic differentiation between populations, we
calculated FST using the same program for every SNP between all pairs of samples, and obtained
a genome-wide FST value by averaging across all SNPs. In order to detect potential genomic
regions under selection, we performed a genome scan by plotting FST values across all scaffolds
bigger than 1,000 bp and searched for outlier regions.
To assess the statistical significance of FST values, we estimated the level of population
genetic differentiation under the assumption of panmixia and very large effective population size,
using a simulation procedure from the empirical allele counts as follows: for every position in the
genome, we started with a total of 240 mapping bases (i.e. four populations times 60X coverage
limit). We counted how many A, C, G, and T were present. We sampled without replacement 60
alleles at a time and randomly reassigned them to each of the four populations. We then calculated
all possible pairwise FST between the populations. We ran 1,500 iterations of this simulation
procedure, and obtained a range of pairwise FST values. Finally, we compared the FST values
obtained from the actual population data to these simulated values.
In addition, we also did a Principal Component Analysis (PCA) with the minor allele
frequencies of each SNP in each population, using the R (R Core Team, 2016) package
FactoMineR (Lê, Josse, & Husson, 2008).
74
4.3 Results
De-novo genome assembly
We used two adult individuals for the P. polymerus genome assembly. One was used to
produce regular libraries for contig assembly, and the second one was used to construct mate pair
libraries for the scaffolding step. Summary statistics for the draft genome assembly are given in
Table 1. We obtained a total of 5,554,320 scaffolds, of which ~300,000 were 1,000 bp or longer.
The amount of DNA in picograms of the haploid genome (i.e. the C-value) of P. polymerus has
been estimated to be 0.9 (Bachmann & Rheinsmith, 1973). This amount equals to a theoretical
genome size of 880 Mbp. However, the total length of all the assembled scaffolds is ~1,600 Mbp,
almost doubling the estimation from C-value. The N50 of the assembly was 7,225 bp.
Pooled sequencing and read mapping
We generated a total of 235,941,958,280 bp for pooled population genomics analysis from
240 individuals (four population, 60 individuals from each population). Mapping rates to the
reference genome assembly were 96% for Pismo Beach, 98% for Malibu, 90% for Catalina Island,
and 95% for San Diego. This results in an average coverage of 79X, 50X, 70X, and 62X
respectively (assuming a genome size of 880 Mbp).
Population genomics analysis
After filtering, we obtained a set of 411,891 SNPs for the analysis. The PCA yielded only
three Principal Components, which, combined, accounted for all the variation found in the data
(Figure 4.1), and revealed several population clustering patterns. PC1 separated Catalina Island
75
from the mainland populations, PC2 separated San Diego from the other three samples, and PC3
separated Pismo Beach.
FST analysis results are summarized in Table 2. In this case, all comparisons involving the
Catalina Island sample showed higher values than any other pairwise comparison (Figure 4.2). The
other three populations showed very similar levels of differentiation. To quantify the higher
differentiation of the Catalina Island sample, we calculated the ratio of the average pairwise FST
values across all comparisons including a population to all comparisons excluding that population
(Table 3). FST values involving Catalina Island are about 60% higher than the other pairwise
comparisons, which highlights the level of differentiation of this population.
The genome scan (Figures 3a and 3b) did not reveal any obvious outlier regions, and it
showed an even distribution of highly differentiated SNPs across all scaffolds in both sets of
pairwise comparisons, with and without Catalina Island.
The FST values obtained in our analysis were very low in general, so we wanted to see if
they are significantly higher than what we would expect under panmixia. We ran a simulation test
based on empirical allele counts, and obtained FST values in the range 0.0084+/- 0.0000504
(Supplementary Figure 1). The values obtained in our FST analysis are slightly above the simulated
range, implying that the level of differentiation we detect between the four P. polymerus
populations analyzed, is higher than what we would expect by chance.
4.4 Discussion
NGS has become a powerful tool for the generation of DNA sequencing data, and the
technology is now accessible to a growing number of researchers. Evolutionary and ecological
studies are no longer limited by the type and number of molecular markers used. There are a variety
of NGS strategies for the generation of informative sequencing data from population and
76
individuals, and they differ mainly in the cost and the level of expertise required for their
application (Jones & Good, 2016; McCormack et al., 2013). Arguably, whole-genome sequencing
(WGS) is the ideal approach, because it allows for an in-depth scan of the genome to search for
past and current demographic and selection signatures, and it has much more resolute power than
conventional molecular markers.
For this study, we have generated a draft of the genome for the North American stalked
barnacle P. polymerus, and we have used it as a reference for the mapping and identification of a
large number of SNPs in pooled sequencing data from four locations across Southern California.
The main goals were to elucidate the population structure pattern of P. polymerus in this region,
and shed light on the role of the CTZ boundary as a phylogeographical barrier.
Population genomics: is the CTZ a phylogeographical barrier for P. polymerus?
The CTZ is a well-recognized biogeographic boundary (Newman, 1979), that marks the
transition point between the Oregonian and the Californian faunas. However, whether the CTZ
also acts as an intraspecific phylogeographic barrier for marine and coastal invertebrate taxa has
been long debated. Many intertidal species show little or no population genetic structure across
the CTZ (Kelly & Palumbi, 2010). Additionally, although some species show clear patterns of
genetic differentiation across Point Conception (Burton, 1998; Eberl et al., 2013), the presence of
further phylogeographic structure within the northern and southern clades in these species, led
Burton et al. (1998) to question Point Conception as the intraspecific break point. Dawson (2001)
has even suggested that the main phylogeographic boundary may not be Point Conception, but
rather the Los Angeles region.
77
In the case of the North American staked barnacle, P. polymerus, several studies, based on
conventional genetic markers, have shown contradictory results. Van Syoc (1994) and Miner
(2001) did not find genetic divergence between populations north and south of Point Conception,
whereas Kelly and Palumbi (2010) found a small but significant level of genetic differentiation.
Cimberg (1981) did not analyze any genetic markers, but based on brooding differences, he
suggested that genetic differences must exist between the two clades. Our analysis of pooled whole
genome sequences from four locations in South California has uncovered an interesting pattern of
genetic structure. We have identified several genetic break points that clearly separate the samples
by geographical location. The only sample that did not show signs of differentiation was Latigo
Point, which is situated in the mainland, in between Pismo Beach, the northernmost population,
and San Diego, the southernmost location. The FST values between the three mainland populations
are low, but significantly higher than expected under panmixia. These results suggest a pattern of
isolation by distance along P. polymerus coastal populations, rather than a sharp phylogeographic
break at Point Conception. Burton (1998) identified very similar phylogeographic breaks between
several mainland populations of Trigriopus californicus. However, the magnitude of the effect of
these genetic breaks on population structure is certainly different between the two species. T.
californicus presents very low levels of gene flow, and high genetic divergence even between
nearby populations, whereas the level of differentiation we have detected between P. polymerus
samples, although significant, is very shallow.
Catalina Island barnacles are different
Based on seasonal brooding differences, Cimberg (1981) recognized Catalina Island
barnacles as a different type; unlike the other populations he examined, these animals show a
78
brooding maximum during the Winter. However, he suggested that Catalina Island barnacles were
part of the northern race, based on their optimal brooding water temperature, and proposed that
genetic differences must exist between the northern and the southern races. Interestingly, our
results placed Catalina Island as the most genetically divergent population, with both PCA and FST
analyses supporting a higher divergence from the three mainland populations. These results do not
support the pattern proposed by Cimberg (1981), and suggest that water temperature is not the
only environmental cue associated with optimal brooding conditions.
In the light of these results, an obvious question arises: can we find a genetic basis for the
difference in brooding between Catalina Island barnacles and the other populations? This would
imply that the genes controlling brooding activity are under selection, and therefore would show a
higher level of genetic differentiation that the rest of the genome. A FST scan across all scaffolds
in our assembly did not detect any obvious outlier genomic regions in the pairwise comparisons
involving Catalina Island (Figure 4.3), and the distribution of the most differentiated SNPs is
similar between the comparisons involving Catalina Island, and all other pairwise comparisons
(Figures 3a and 3b). The results potentially suggest that the higher divergence of the Catalina
Island population is due to demographic processes, rather than selection acting on brooding genes.
If there is not a genetic basis for the difference in brooding activity, then it is likely due to
phenotypic plasticity. The intertidal is a highly variable environment that varies in a predictable
way. In these cases, natural selection may favor plasticity over fixed genetic responses (DeWitt &
Scheiner, 2004). In fact, Miner (2002) concluded that the differences in brooding between the
northern and the southern physiological races of Cimberg (1981) must be due to plasticity. When
the environment is favoring a certain phenotype in a specific location, if gene flow is restricted,
genetic differences may arise, leading to population divergence. Our FST scan results support a
79
scenario of phenotypic plasticity in brooding activity, coupled with a somewhat restricted gene
flow between Catalina Island and the mainland populations.
Finally, as a word of caution, it is worth noting that our genome reference is a highly
fragmented assembly of hundreds of thousands of scaffolds, and that we do not know how many
genes are involved in the control of brooding activity, which might be potentially preventing the
detection of selection. Therefore, although our current results support the phenotypic plasticity
hypothesis, as stated above, we cannot totally rule out the idea of a genetic basis for the brooding
differences of Catalina Island barnacles.
A genome reference sequence for P. polymerus
As part of this work, we have sequenced and assembled a draft genome of the North
American stalked barnacle, P. polymerus. This is the first time a genome sequence is reported for
a pedunculate barnacle, and only the second of a cirriped, after Semibalanus balanoides (Flight &
Rand, 2012). Our assembly consists of 5.5 million scaffolds, with approximately 5% of them being
more than 1 Kbp in length, and a N50 of about 7 Kbp. The genome size estimate from the C-value
is 880 Mbp, whereas the estimation from the assembly is roughly double. This difference might
be due to variation in the insert size of the mate pair libraries used for scaffolding; if the insert size
was smaller than expected, the assembler might be overestimating the number of unknown base
pairs (i.e. N's) between contigs in the scaffolds, thus inflating the estimated size. Indeed, a quick
assembly run with SOAPdenovo2, excluding the mate pair libraries yielded an estimate of 930
Mbp, much closer to the C-value estimate. Reported genome sizes for other barnacle species go
from 723 Mbp to 1.4 Gbp, with the exception of Tetraclita rubsecens, with an estimated genome
of 2.54 Gbp (Gregory, 2017, and references therein).
80
The availability of a genome reference for P. polymerus will facilitate future evolutionary
and ecological study on this iconic intertidal species. Although still in a preliminary draft state, the
genome assembly used in this study has allowed us to identify more than 400,000 SNPs that were
used for population analysis. Current work in our lab includes improving the assembly, and
genome annotation.
P. polymerus is also an important commercial species. Fisheries for this animal exist in
British Columbia (Canada). Although these fisheries supply local markets, the main bulk of the
catches is exported to Spain and Portugal to supplement the demand of the sister species P.
pollicipes in the Iberian markets (López, López, Pham, Isidro, & De Girolamo, 2010). P. pollicipes
is considered a delicacy in Spain and Portugal, and can reach very high prices in the market.
Consequently, the potential culture of this species has attracted much interest (Franco et al. 2015).
An annotated genome reference of P. polymerus can greatly help to establish successful
aquaculture in the long term.
81
Figure 4.1 Principal component analysis (PCA) of minor allele
frequencies in the four populations analyzed in this work. a) PC1
and PC2, and b) PC1 and PC3.
Figure 4.2 Boxplots of FST distribution for all pairwise
comparison among the four populations.
82
Figure 4.3 FST values across all scaffold in the assembly.
The red line indicates 0.1% quantile. a) Pairwise comparisons
involving Catalina Island b) Pairwise comparisons without Catalina
Island
83
Supplementary Figure 1. Range of simulated FST values under the null hypothesis of
panmixia.
Table 4.1 Summary of assembly statistics
Number of scaffolds (>=
1000 bp) 300,833
Total length (>= 1000 bp) 1,631,375,551
GC (%) 42.17
N50 7225
Table 4.2 Pairwise genome-wide average FST calculations
Mean FST Malibu
San
Diego Catalina
Pismo Beach 0.0091 0.0093 0.0138
84
Malibu
0.0094 0.0142
San Diego
0.0133
Table 4.3 Ratio test for each population. For each population
P, sum of FST values from pairwise comparisons that include P is
divided into the sum of FST values from pairwise comparisons that
does not include P. The resulting fold difference shows that
pairwise comparisons including Catalina Island is ~58% higher than
others.
Fold
difference
Malibu 0.899
Pismo Beach 0.874
San Diego 0.864
Catalina
Island 1.480
85
Chapter 5 General Conclusion
In this thesis next generation was used to study sessile marine invertebrates. In the second
chapter the genetic basis of resistance to OsHV-1 virus for Pacific Oyster was studied. This study
revealed a list of candidate loci for resistance to virus. These markers are likely candidates for
biomarker assisted selection. In the second third the adaptation of Gooseneck barnacle, P.
polymerus, was inspected. Intertidal habitat is an ever-changing habitat due to the tides.
Understanding how animals adapt to these conditions by investigating the circadian and circatidal
clocks is a very crucial study both from an aquaculture perspective and from an ecological
perspective. This study generated a list of candidate biological processes that have the potential to
be responsible for P. polymerus’ rhythms as a response to lunar and solar cycles. Last chapter
examined the population structure of P. polymerus and discovered that Catalina Island populations
are highly diverged from the mainland populations and this chapter generated the first de-novo
genome assembly for this species.
Significance and impact of this study
This study utilized next generation sequencing to dissect the genetic basis of disease
resistance and internal clocks in two marine sessile invertebrates. Pacific Oyster has a published
genome assembly but the assembly has many problems as stated in the first chapter of this thesis.
In order to overcome these problems, we used a linkage map generated form a family of random
bred oysters from unrelated parents. Gooseneck barnacle on the other hand is an organism with no
genomic resources available. In this thesis we did a de-novo transcriptome analysis and a de-novo
genome analysis of this species. As the next generation sequencing technologies advances
researchers will be able to study a wide variety of organism outside of the traditional model
86
organism. This thesis shows a proof of concept on how much one can achieve by starting from
scratch to study an organism in the wild with limited genomic resources.
87
References
Allada, R., White, N. E., So, W. V., Hall, J. C., & Rosbash, M. (1998). A Mutant Drosophila
Homolog of Mammalian Clock Disrupts Circadian Rhythms and Transcription of period and
timeless. Cell, 93(5), 791–804. https://doi.org/10.1016/S0092-8674(00)81440-3
Allendorf, F. W. (2017). Genetics and the conservation of natural populations: allozymes to
genomes. Molecular Ecology, 26(2), 420–430. https://doi.org/10.1111/mec.13948
Altschup, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic Local
Alignment Search Tool. J. Mol. Biol, 215, 403–410. Retrieved from
https://www.ncbi.nlm.nih.gov/pubmed/2231712
Andrews, S. (2012). FastQC A Quality Control tool for High Throughput Sequence Data.
Retrieved from https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Bachmann, K., & Rheinsmith, E. L. (1973). Nuclear DNA Amounts in Pacific Crustacea.
Chromosoma (Berl.), 43, 225–236. Retrieved from
https://link.springer.com/content/pdf/10.1007%2FBF00294271.pdf
Baird, N. A., Etter, P. D., Atwood, T. S., Currey, M. C., Shiver, A. L., Lewis, Z. A., … Johnson,
E. A. (2008). Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers.
PLoS ONE, 3(10). https://doi.org/10.1371/journal.pone.0003376
Barazandeh, M., Davis, C. S., Neufeld, C. J., Coltman, D. W., & Palmer, A. R. (2013). Something
Darwin didn’t know about barnacles: spermcast mating in a common stalked species.
Proceedings. Biological Sciences, 280(1754). https://doi.org/10.1098/rspb.2012.2919
Barazandeh, M., & Palmer, · A Richard. (2015). Novel mating modes on wave‑swept shores: aerial
copulation and sperm release in an intertidal stalked barnacle. Mar Biol, 162, 881–888.
https://doi.org/10.1007/s00227-015-2631-y
88
Barnes, H., & Reese, E. S. (1959). Feeding in the pedunculate cirripede Pollicipes polymerus J. B.
Sowerby. Proceedings of the Zoological Society of London, 132(4), 569–584.
https://doi.org/10.1111/j.1469-7998.1959.tb05537.x
Barnes, H., & Reese, E. S. (1960). The Behaviour of the Stalked Intertidal Barnacle Pollicipes
polymerus J. B. Sowerby, with Special Reference to its Ecology and Distribution. Jounal of
Animal Ecology, 29(1), 169–185. https://doi.org/10.2307/2276
Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C., & Apweiler, R. (2009). QuickGO:
a web-based tool for Gene Ontology searching. Bioinformatics (Oxford, England), 25(22),
3045–3046. https://doi.org/10.1093/bioinformatics/btp536
Blaylock, R. B., & Bullard, S. A. (2014). Counter-insurgents of the blue revolution? Parasites and
diseases affecting aquaculture and science. The Journal of Parasitology J. Parasitol, 100(6),
743–755. Retrieved from http://www.jstor.org/stable/24625277
Bohra, A. (2013). Emerging paradigms in genomics-based crop improvement. The Scientific
World Journal, 2013. https://doi.org/10.1155/2013/585467
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina
sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120.
https://doi.org/10.1093/bioinformatics/btu170
Broman, K. W., Wu, H., Sen, S., & Churchill, G. A. (2003). R/qtl: QTL mapping in experimental
crosses. Bioinformatics Applications Note, 19(7), 889–890.
https://doi.org/10.1093/bioinformatics/btg112
Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A Reference-Free
Algorithm for Computational Normalization of Shotgun Sequencing Data. Retrieved from
https://arxiv.org/pdf/1203.4802.pdf
89
Burge, C. A., & Friedman, C. S. (2012). Quantifying Ostreid Herpesvirus (OsHV-1) Genome
Copies and Expression during Transmission. Microbial Ecology, 63, 569–604.
https://doi.org/10.1007/s00248-011-9937-1
Burge, S., Kelly, E., Lonsdale, D., Mutowo-Muellenet, P., McAnulla, C., Mitchell, A., … Hunter,
S. (2012). Manual GO annotation of predictive protein signatures: the InterPro approach to
GO curation. Database : The Journal of Biological Databases and Curation, 2012, bar068.
https://doi.org/10.1093/database/bar068
Byrdt, R. H., Lut, P., Nocedalt, J., & Zhu, C. (1995). A LIMITED MEMORY ALGORITHM FOR
BOUND CONSTRAINED OPTIMIZATION*. SIAM J. ScI. COMPUT, 16(5), 1190–1208.
Retrieved from https://epubs.siam.org/doi/pdf/10.1137/0916069
Cheeseman, J. F., Fewster, R. M., & Walker, M. M. (2017). Circadian and circatidal clocks control
the mechanism of semilunar foraging behaviour. Scientific Reports, 7(1), 3780.
https://doi.org/10.1038/s41598-017-03245-3
Cimberg, R. L. (1981). Variability in brooding activity in the stalked barnacle Pollicipes
polymerus. Reference: Biol. Bull, 160(42). Retrieved from
https://www.journals.uchicago.edu/doi/pdfplus/10.2307/1540898
Collén, J., Guisle-Marsollier, I., Léger, J. J., & Boyen, C. (2007). Response of the transcriptome
of the intertidal red seaweed Chondrus crispus to controlled and natural stresses. New
Phytologist, 176(1), 45–55. https://doi.org/10.1111/j.1469-8137.2007.02152.x
Contreras-Porcia, L., Thomas, D., Flores, V., & Correa, J. A. (2011). Tolerance to oxidative stress
induced by desiccation in Porphyra columbina (Bangiales, Rhodophyta). Journal of
Experimental Botany, 62(6), 1815–1829. https://doi.org/10.1093/jxb/erq364
Core Team, R. (2017). R: A language and environment for statistical computing. R Foundation for
90
Statistical Computing. Vienna. Retrieved from https://www.r-project.org/
Davis, D. A. (2015). Feed and Feeding Practices in Aquaculture.
Davison, A. J., Trus, B. L., Cheng, N., Steven, A. C., Watson, M. S., Cunningham, C., … Andrew
Davison, C. J. (2005). A novel class of herpesvirus with bivalve hosts. Journal of General
Virology, 86, 41–53. https://doi.org/10.1099/vir.0.80382-0
Dégremont, L. (2011). Evidence of herpesvirus (OsHV-1) resistance in juvenile Crassostrea gigas
selected for high resistance to the summer mortality phenomenon. Aquaculture, 317(July),
94–98.
Dégremont, L. (2013). Size and genotype affect resistance to mortality caused by OsHV-1 in
Crassostrea gigas. Aquaculture, 416–417, 129–134.
https://doi.org/10.1016/J.AQUACULTURE.2013.09.011
Dégremont, L., Garcia, C., & Allen, S. K. (2015). Genetic improvement for disease resistance in
oysters: A review. Journal of Invertebrate Pathology, 131, 226–241.
https://doi.org/10.1016/j.jip.2015.05.010
Dégremont, L., Lamy, J. B., Pépin, J. F., Travers, M. A., & Renault, T. (2015). New insight for
the genetic evaluation of resistance to ostreid herpesvirus infection, a worldwide disease, in
Crassostrea gigas. PLoS ONE, 10(6), 1–12. https://doi.org/10.1371/journal.pone.0127917
Dégremont, L., Nourry, M., & Maurouard, E. (2015). Mass selection for survival and resistance to
OsHV-1 infection in Crassostrea gigas spat in field conditions: Response to selection after
four generations. Aquaculture, 446, 111–121.
https://doi.org/10.1016/j.aquaculture.2015.04.029
DeWitt, T. J., & Scheiner, S. M. (2004). Phenotypic plasticity : functional and conceptual
approaches. Oxford University Press.
91
Dimmer, E. C., Huntley, R. P., Alam-Faruque, Y., Sawford, T., O’Donovan, C., Martin, M. J., …
Apweiler, R. (2012). The UniProt-GO Annotation database in 2011. Nucleic Acids Research,
40(Database issue), D565-70. https://doi.org/10.1093/nar/gkr1048
Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S.
E. (2011). A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity
Species. PLoS ONE, 6(5), e19379. https://doi.org/10.1371/journal.pone.0019379
FAO. (2016). The State of World Fisheries and Aquaculture 2016. Retrieved from
http://www.fao.org/3/a-i5555e.pdf
Flight, P. A., & Rand, D. M. (2012). Genetic variation in the acorn barnacle from allozymes to
population genomics. Integrative and Comparative Biology, 52(3), 418–429.
https://doi.org/10.1093/icb/ics099
Fraser, D., Weary, D., Pajor, E., & Milligan, B. (1997). A Scientific Conception of Animal Welfare
that Reflects Ethical Concerns. Ethics and Animal Welfare. Retrieved from
https://animalstudiesrepository.org/ethawel/1
Friedman, C., Estes, R., Stokes, N., Burge, C., Hargove, J., Barber, B., … Reece, K. (2005). Herpes
virus in juvenile Pacific oysters Crassostrea gigas from Tomales Bay, California, coincides
with summer mortality episodes. Diseases of Aquatic Organisms, 63(1), 33–41.
https://doi.org/10.3354/dao063033
Garcia, C., Thébault, A., Dégremont, L., Arzul, I., Miossec, L., Robert, M., … Renault, T. (2011).
Ostreid herpesvirus 1 detection and relationship with Crassostrea gigas spat mortality in
France between 1998 and 2006. Veterinary Research, 42(1), 73.
https://doi.org/10.1186/1297-9716-42-73
Gracey, A. Y., Chaney, M. L., Boomhower, J. P., Tyburczy, W. R., Connor, K., & Somero, G. N.
92
(2008). Rhythms of Gene Expression in a Fluctuating Intertidal Environment. Current
Biology, 18(19), 1501–1507. https://doi.org/10.1016/J.CUB.2008.08.049
Green, T. J., Vergnes, A., Montagnani, C., & de Lorgeril, J. (2016). Distinct immune responses of
juvenile and adult oysters (Crassostrea gigas) to viral and bacterial infections. Veterinary
Research, 47(1). https://doi.org/10.1186/s13567-016-0356-7
Gregory, R. (2017). Animal Genome Size Database:: Home. Retrieved June 5, 2018, from
http://www.genomesize.com/
Guo, X., & Ford, S. E. (2016). Infectious diseases of marine molluscs and host responses as
revealed by genomic tools. Philosophical Transactions of the Royal Society of London. Series
B, Biological Sciences, 371(1689). https://doi.org/10.1098/rstb.2015.0206
Gutierrez, A. P., Bean, T. P., Hooper, C., Stenton, C. A., Sanders, M. B., Paley, R. K., … Houston,
R. D. (2018). A Genome-Wide Association Study for Host Resistance to Ostreid Herpesvirus
in Pacific Oysters ( Crassostrea gigas ). G3: Genes, Genomes, Genetics, 8(3).
https://doi.org/10.1534/g3.118.200113
Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., … Regev, A.
(2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform
for reference generation and analysis. Nature Protocols, 8(8), 1494–1512.
https://doi.org/10.1038/nprot.2013.084
Hardin, P. E., Hall, J. C., & Rosbash, M. (1990). Feedback of the Drosophila period gene product
on circadian cycling of its messenger RNA levels. Nature, 343(6258), 536–540.
https://doi.org/10.1038/343536a0
Hargrave, B. T. (2010). Empirical relationships describing benthic impacts of salmon aquaculture.
Aquaculture Environment Interactions. https://doi.org/10.3354/aei00005
93
Hatanaka, R., Sekine, Y., Hayakawa, T., Takeda, K., & Ichijo, H. (2009). Signaling pathways in
invertebrate immune and stress response. ISJ, 6, 32–43. Retrieved from
http://www.isj.unimo.it/articoli/ISJ180.pdf
Hedgecock, D., Shin, G., Gracey, A. Y., Den Berg, D. Van, & Samanta, M. P. (2015). Second-
Generation Linkage Maps for the Pacific Oyster Crassostrea gigas Reveal Errors in Assembly
of Genome Scaffolds. G3: GENES, GENOMES, GENETICS, 5(10), 2007–2019.
https://doi.org/10.1534/g3.115.019570
Helm, M. M. (2004). Hatchery culture of bivalves. A practical manual. (A. Lovatelli, Ed.). Rome:
Inland Water Resources and Aquaculture Department. Retrieved from
http://www.fao.org/docrep/007/y5720e/y5720e00.htm#Contents
Hohenlohe, P. A., Bassham, S., Etter, P. D., Stiffler, N., Johnson, E. A., & Cresko, W. A. (2010).
Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced
RAD Tags. https://doi.org/10.1371/journal.pgen.1000862
Houston, R. D., & Houston, R. D. (2017). Future directions in breeding for disease resistance in
aquaculture species. Revista Brasileira de Zootecnia, 46(6), 545–551.
https://doi.org/10.1590/s1806-92902017000600010
Hu, Z., & Xu, S. (2009). PROC QTL—A SAS Procedure for Mapping Quantitative Trait Loci.
International Journal of Plant Genomics. https://doi.org/10.1155/2009/141234
Hubert, S., Cognard, E., & Hedgecock, D. (2009). Centromere mapping in triploid families of the
Pacific oyster Crassostrea gigas (Thunberg). Aquaculture, 288, 172–183.
https://doi.org/10.1016/J.AQUACULTURE.2008.12.006
Hubert, S., & Hedgecock, D. (2004). Linkage maps of microsatellite DNA markers for the Pacific
oyster Crassostrea gigas. Genetics, 168(1), 351–362.
94
https://doi.org/10.1534/genetics.104.027342
Huntley, R. P., Sawford, T., Martin, M. J., & O’Donovan, C. (2014). Understanding how and why
the Gene Ontology and its annotations evolve: the GO within UniProt. GigaScience, 3(1), 4.
https://doi.org/10.1186/2047-217X-3-4
Jones, M. R., & Good, J. M. (2016). Targeted capture in evolutionary and ecological genomics.
Molecular Ecology, 25(1), 185–202. https://doi.org/10.1111/mec.13304
Kelly, R. P., & Palumbi, S. R. (2010). Genetic Structure Among 50 Species of the Northeastern
Pacific Rocky Intertidal Community. PLoS ONE, 5(1).
https://doi.org/10.1371/journal.pone.0008594
Kjøglum, S., Henryon, M., Aasmundstad, T., & Korsgaard, I. (2008). Selective breeding can
increase resistance of Atlantic salmon to furunculosis, infectious salmon anaemia and
infectious pancreatic necrosis. Aquaculture Research, 39(5), 498–505.
https://doi.org/10.1111/j.1365-2109.2008.01904.x
Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation
between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics
(Oxford, England), 27(24), 3435–3436. https://doi.org/10.1093/bioinformatics/btr589
Konopka, R. J., & Benzer, S. (1971). Clock mutants of Drosophila melanogaster. Proceedings of
the National Academy of Sciences of the United States of America, 68(9), 2112–2116.
Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/5002428
Kube, P., Dove, M., Cunningham, M., Kirkland, P., Gu, X., Hick, P., … Elliott, N. (2018). Genetic
Selection for Resistance to Pacific Oyster Mortality Syndrome.
Lafferty, K. D., DeLeo, G., Briggs, C. J., Dobson, A. P., Gross, T., & Kuris, A. M. (2015).
Ecological theory. A general consumer-resource population model. Science (New York, N.Y.),
95
349(6250), 854–857. https://doi.org/10.1126/science.aaa6224
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature
Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923
Launey, S., & Hedgecock, D. (2001). High Genetic Load in the Pacific Oyster Crassostrea gigas.
Genetics, 159, 255–265. Retrieved from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1461778/pdf/11560902.pdf
Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). voom: precision weights unlock linear
model analysis tools for RNA-seq read counts. Genome Biology, 15(2).
https://doi.org/10.1186/gb-2014-15-2-r29
Lê, S., Josse, J., & Husson, F. (2008). FactoMineR : An R Package for Multivariate Analysis.
Journal of Statistical Software, 25(1), 1–18. https://doi.org/10.18637/jss.v025.i01
Leggett, R. M., Clavijo, B. J., Clissold, L., Clark, M. D., & Caccamo, M. (2014). NextClip: an
analysis and read preparation tool for Nextera Long Mate Pair libraries. Bioinformatics
(Oxford, England), 30(4), 566–568. https://doi.org/10.1093/bioinformatics/btt702
Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with
or without a reference genome. BMC Bioinformatics, 12(323). https://doi.org/10.1186/1471-
2105-12-323
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler
transform. Bioinformatics, 25(14), 1754–1760.
https://doi.org/10.1093/bioinformatics/btp324
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Subgroup, D. P. (2009).
The Sequence Alignment/Map format and SAMtools. Bioinformatics Applications Note,
25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
96
Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M., …
Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan genomes.
Nature, 469(7331), 529–533. https://doi.org/10.1038/nature09687
López, D. A., López, B. A., Pham, C. K., Isidro, E. J., & De Girolamo, M. (2010). Barnacle culture:
background, potential and challenges. Aquaculture Research, 41(10), e367–e375.
https://doi.org/10.1111/j.1365-2109.2010.02508.x
Luo, L., & Xu, S. (2003). Mapping viability loci using molecular markers. Heredity, 90, 459–467.
https://doi.org/10.1038/sj.hdy.6800264
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., … Wang, J. (2012). SOAPdenovo2: an
empirically improved memory-efficient short-read de novo assembler. GigaScience, 1(1), 18.
https://doi.org/10.1186/2047-217X-1-18
McCormack, J. E., Hird, S. M., Zellmer, A. J., Carstens, B. C., & Brumfield, R. T. (2013).
Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular
Phylogenetics and Evolution, 66(2), 526–538.
https://doi.org/10.1016/J.YMPEV.2011.12.007
Miller, W., Schuster, S. C., Welch, A. J., Ratan, A., Bedoya-Reina, O. C., Zhao, F., … Lindqvist,
C. (2012). Polar and brown bear genomes reveal ancient admixture and demographic
footprints of past climate change. Proceedings of the National Academy of Sciences of the
United States of America, 109(36), E2382-90. https://doi.org/10.1073/pnas.1210506109
Miner, B. G. (2002). Are the two physiological races of Pollicipes polymerus (Cirripedia)
genetically divided along the California coast? Invertebrate Biology California Coast (Ford
& Mitton Sarver & Foltz Palumbi Beauchamp & Powers Edmands et Al, 121(2), 158–162.
Retrieved from
97
https://www.jstor.org/stable/pdf/3227240.pdf?refreqid=excelsior%3A76f509e7a0aa6a43ed5
c1321a2a2f1c8
Morash, A. J., & Alter, K. (2016). Effects of environmental and farm stress on abalone physiology:
perspectives for abalone aquaculture in the face of global climate change. Reviews in
Aquaculture, 8, 342–368. https://doi.org/10.1111/raq.12097
Morris, R. H. (Robert H., Abbott, D. P., & Haderlie, E. C. (1980). Intertidal invertebrates of
California (pp. 504–535). Stanford University Press.
Morris, R. H., Abbott, D. P., & Haderlie, E. C. (1981). Intertidal invertebrates of California, with
31 text contributors. Curator: The Museum Journal (Vol. 24). Wiley/Blackwell (10.1111).
https://doi.org/10.1111/j.2151-6952.1981.tb01515.x
Muralidharan, S., & Mandrekar, P. (2013). Cellular stress response and innate immune signaling:
integrating pathways in host defense and inflammation. Journal of Leukocyte Biology, 94(6),
1167–1184. https://doi.org/10.1189/jlb.0313153
Newman, W. A. (1979). Californian transition zone: significance of short-range endemics.
Historical biogeography, plate tectonics and the changing environment. (pp. 399–416).
Corvallis: Oregon State University Press.
Park, S. T., & Kim, J. (2016). Trends in Next-Generation Sequencing and a New Era for Whole
Genome Sequencing. International Neurourology Journal, 20(Suppl 2), S76-83.
https://doi.org/10.5213/inj.1632742.371
Pasternak, Z., Achituv, Y., Mina, T., & Goodman, E. (2007). Feeding behavior of shallow-water
barnacles from the mediterranean and red seas. Journal of Crustacean Biology, 27(4), 543–
547. Retrieved from https://academic.oup.com/jcb/article/27/4/543/2548178
Pernet, F., Lupo, C., Bacher, C., & Whittington, R. J. (2016). Infectious diseases in oyster
98
aquaculture require a new integrated approach. Philosophical Transactions of the Royal
Society of London. Series B, Biological Sciences, 371(1689).
https://doi.org/10.1098/rstb.2015.0213
Petersen, J. A., Fyhn, H. J., & Johansen, K. (1974). Eco-physiological studies of an intertidal
crustacean, Pollicipes polymerus (Cirripedia, Lepadomorpha): aquatic and aerial respiration.
The Journal of Experimental Biology, 61(2), 309–320. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/4443730
Plough, L. V., & Hedgecock, D. (2011). Quantitative trait locus analysis of stage-specific
inbreeding depression in the pacific oyster Crassostrea gigas. Genetics, 189(4), 1473–1486.
https://doi.org/10.1534/genetics.111.131854
Plough, L. V., Shin, G., & Hedgecock, D. (2016). Genetic inviability is a major driver of type III
survivorship in experimental families of a highly fecund marine bivalve. Molecular Ecology,
25, 895–910. https://doi.org/10.1111/mec.13524
R Core Team. (2016). R: A language and environment for statistical computing. Retrieved from
https://www.r-project.org/
Rastas, P. (2017). Lep-MAP3: robust linkage mapping even for low-coverage whole genome
sequencing data. Bioinformatics, 33(23), 3726–3732.
https://doi.org/10.1093/bioinformatics/btx494
RJ, V. S. (1994). Molecular phylogenetics and population structure derived from mitochondrial
DNA sequence variation in the edible goose barnacle genus Pollicipes (Cirripedia,
Crustacea).
Robinson, M. D., Mccarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatic Applications
99
Note, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Roque, A., Carrasco, N., Andree, K. B., Lacuesta, B., Elandaloussi, L., Gairin, I., … Furones, M.
D. (2012). First report of OsHV-1 microvar in Pacific oyster (Crassostrea gigas) cultured in
Spain. Aquaculture, 324–325, 303–306.
https://doi.org/10.1016/J.AQUACULTURE.2011.10.018
Samain, J.-F. (2011). Review and perspectives of physiological mechanisms underlying
genetically-based resistance of the Pacific oyster Crassostrea gigas to summer mortality.
Aquat. Living Resour. EDP Sciences, 24, 227–236. https://doi.org/10.1051/alr/2011144
Sauvage, C., Boudry, P., de Koning, D.-J., Haley, C. S., Heurtebise, S., & Lapègue, S. (2010).
QTL for resistance to summer mortality and OsHV-1 load in the Pacific oyster ( Crassostrea
gigas ). Animal Genetics, 41(4), 390–399. https://doi.org/10.1111/j.1365-2052.2009.02018.x
Segarra, A., Pépin, J. F., Arzul, I., Morga, B., Faury, N., & Renault, T. (2010). Detection and
description of a particular Ostreid herpesvirus 1 genotype associated with massive mortality
outbreaks of Pacific oysters, Crassostrea gigas, in France in 2008. Virus Research, 153(1),
92–99. https://doi.org/10.1016/J.VIRUSRES.2010.07.011
Supek, F., Bošnjak, M., Škunca, N., & Šmuc, T. (2011). REVIGO Summarizes and Visualizes
Long Lists of Gene Ontology Terms. PLoS ONE, 6(7), e21800.
https://doi.org/10.1371/journal.pone.0021800
Teer, J. K., & Mullikin, J. C. (2010). Exome sequencing: the sweet spot before whole genomes.
Human Molecular Genetics, 19(R2), R145–R151. https://doi.org/10.1093/hmg/ddq333
Thitamadee, S., Prachumwat, A., Srisala, J., Jaroenlak, P., Salachan, P. V., Sritunyalucksana, K.,
… Itsathitphaisarn, O. (2016). Review of current disease threats for cultivated penaeid shrimp
in Asia. Aquaculture, 452, 69–87. https://doi.org/10.1016/J.AQUACULTURE.2015.10.028
100
Turcotte, M. M., Araki, H., Karp, D. S., Poveda, K., & Whitehead, S. R. (2017). The eco-
evolutionary impacts of domestication and agricultural practices on wild species. Phil. Trans.
R. Soc., 372. https://doi.org/10.1098/rstb.2016.0033
Wang, Z., Pascual-Anaya, J., Zadissa, A., Li, W., Niimura, Y., Huang, Z., … Irie, N. (2013). The
draft genomes of soft-shell turtle and green sea turtle yield insights into the development and
evolution of the turtle-specific body plan. Nature Genetics, 45(6), 701–706.
https://doi.org/10.1038/ng.2615
Wilcockson, D., & Zhang, L. (2008). Circatidal clocks. Current Biology : CB, 18(17), R753–R755.
https://doi.org/10.1016/j.cub.2008.06.041
Yáñez, J. M., Houston, R. D., & Newman, S. (2014). Genetics and genomics of disease resistance
in salmonid species. Frontiers in Genetics, 5, 415. https://doi.org/10.3389/fgene.2014.00415
Yon Rhee, S., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology
annotations. Nature Reviews Genetics, 9(7), 509–515. https://doi.org/10.1038/nrg2363
Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Method Gene ontology
analysis for RNA-seq: accounting for selection bias. Genome Biology, 11. Retrieved from
http://genomebiology.com/2010/11/2/R14
Zehring, W. A., Wheeler, D. A., Reddy, P., Konopka, R. J., Kyriacou, C. P., Rosbash, M., & Hall,
J. C. (1984). P-element transformation with period locus DNA restores rhythmicity to mutant,
arrhythmic drosophila melanogaster. Cell, 39(2), 369–376. https://doi.org/10.1016/0092-
8674(84)90015-1
Zhang, G., Fang, X., Guo, X., Li, L., Luo, R., Xu, F., … Wang, J. (2012). The oyster genome
reveals stress adaptation and complexity of shell formation. Nature, 490(7418), 49–54.
https://doi.org/10.1038/nature11413
101
Zhang, L., Hastings, M. H., Green, E. W., Tauber, E., Sladek, M., Webster, S. G., … Wilcockson,
D. C. (2013). Article Dissociation of Circadian and Circatidal Timekeeping in the Marine
Crustacean Eurydice pulchra. CURBIO, 23, 1863–1873.
https://doi.org/10.1016/j.cub.2013.08.038
Abstract (if available)
Abstract
Seventy percent of Earth’s surface are oceans and 99% of the Earth’s living space is provided by the oceans. Besides, oceans provide the highest amount of protein for human consumption although it is being “farmed” mostly from wild animals. Domestication of marine life is almost not existent. Domestication of wild species improves yield and efficiency of farming. (Turcotte, Araki, Karp, Poveda, & Whitehead, 2017). This is a contrast to land-based agriculture production, which is mainly based of domesticated products. Ocean based farming, also called aquaculture, has great potential to feed the world. More precisely, aquaculture is defined as the breeding, rearing, and harvesting of all types of animals and plants in any type of water environments. Although aquaculture is a very old practice, up until the 1970’s the amount of sea food produced by aquaculture was approximately the same as the amount of sea food produced by wild capture. After the 1990’s, production from wild capture stayed the same, while aquaculture production increased steadily every year up until 2016 (FAO 2018) and is predicted to continue to grow (FAO, 2016). The main reason for this is that we are already harvesting the wild as much as we can and any further harvesting will result in the decay of natural environment, which provides the food. To prevent this, worldwide fisheries management activities are in place. Given that harvesting the wild habitats is not a sustainable way to increase food production, aquaculture is a great solution for this problem. ❧ There are different types of aquaculture, such as finfish aquaculture or shellfish aquaculture. Finfish aquaculture is an effective way of producing a high amount of fish without depleting the natural fisheries but it has many disadvantages compared to farming sessile invertebrates such as shellfish or barnacles. Farming of sessile invertebrates such as mussels, oysters or barnacles has many advantages compared to finfish aquaculture, among which are animal density and feeding requirements. First, these animals naturally occur at high densities in their natural habitat. Having a high animal density is a requirement for any farming practice. Other species, such as fish, need to be confined in a small space, which is very different than their natural habitat (Fraser, Weary, Pajor, & Milligan, 1997). Sessile invertebrates can often be kept at high density on farms, which are in the natural habitat. Secondly, fish require a very high amount of feeding material, which can be a problem both economically and environmentally (Davis, 2015). Sessile invertebrates, such as shellfish or barnacles, are filter feeders and can feed on the available algae and zooplanktons (FAO, 2016). The amount of waste produced by finfish aquaculture like salmon can have serious environmental impact (Hargrave, 2010). ❧ One problem that arises when animals are kept in high densities is the spread of disease. Viral and bacterial infections can easily transmit and kill crops. These types of diseases are manifested based on the interaction between the host and disease agent, which has three components: environment, timing and genetic background. The environmental conditions need to be such that the animal is susceptible and the disease agent is active. Disease agent needs to attack the host in the right time of the year and at the right stage in the host’s development. Beside the genetic background of the host animal is a very crucial factor in the resistance of the host against the disease agent. It has been shown that selective breeding can improve resistance against bacterial and viral infections (Houston & Houston, 2017
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The evolution of gene regulatory networks
PDF
Genetic architectures of phenotypic capacitance
PDF
Diversity and dynamics of giant kelp “seed-bank” microbiomes: Applications for the future of seaweed farming
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
PDF
A multi-omics investigation into breeding shellfish for ocean acidification resilience in the California current system
PDF
Robustness and stochasticity in Drosophila development
PDF
Genetic architecture underlying variation in different traits in the Pacific oyster Crassostrea gigas
PDF
Investigating the potential roles of three mammalian traits in female reproductive investment
PDF
Genome-wide analysis of genetic load and larval mortality in a highly fecund marine invertebrate, the Pacific Oyster Crassostrea gigas
PDF
Physiological strategies of resilience to environmental change in larval stages of marine invertebrates
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Kitapci, Tevfik Hamdi
(author)
Core Title
Applications of next generation sequencing in sessile marine invertabrates
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
08/05/2018
Defense Date
05/31/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
barnacle,circadian and circatidal clocks,Crassostrea gigas,marine Invertebrates,next generation sequencing,OAI-PMH Harvest,Pacific oyster,Pollicipes polymerus,viral resistance
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey (
committee chair
), Dean, Matthew (
committee member
), Ehrenreich, Ian (
committee member
), Hedgecock, Dennis (
committee member
)
Creator Email
thkitapci@gmail.com,tkitapci@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-54761
Unique identifier
UC11672501
Identifier
etd-KitapciTev-6649.pdf (filename),usctheses-c89-54761 (legacy record id)
Legacy Identifier
etd-KitapciTev-6649.pdf
Dmrecord
54761
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kitapci, Tevfik Hamdi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
barnacle
circadian and circatidal clocks
Crassostrea gigas
marine Invertebrates
next generation sequencing
Pacific oyster
Pollicipes polymerus
viral resistance