Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
(USC Thesis Other)
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A GENERALIZATION OF THE ACCUMULATION CURVE IN SPECIES SAMPLING AND ITS APPLICATIONS TO HIGH-THROUGHPUT SEQUENCING by Chao Deng A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2018 Copyright 2018 Chao Deng Everything should be made as simple as possible, but no simpler. -Albert Einstein. ii Acknowledgments I would like to thank Dr. Andrew Smith, who has been an excellent advisor and a friend. His hard work, dedication and commitment to excellence is the most valuable lesson to me during my PhD studies. I would also like to express my gratitude to the rest of my dissertation committee Dr. Fengzhu Sun and Dr. Stanislav Minsker, and my qualification exam committee Dr. Matt Dean, Dr. Paul Marjoram and Dr. Peter Ralph. Their comments and insights greatly improve the quality of my work. I am so fortunate to have two talented collaborators Dr. Timothy Daley and Dr. Peter Calabrese. It is really a pleasure to work with them. I am grateful to join the smith lab and get to know many awesome people, who help me both in research and in life. Most importantly, I would like to thank my family for their love and patience. My wife, Jie Ren, trusts and supports me all these years. My parents, Jianhua Miao and Jingping Deng, always encourage and believe in me. Without them, I would never accomplish this work. iii Contents Epigraph ii Acknowledgments iii List of Figures vi List of Tables vii 1 Introduction 1 2 Species accumulation curves 4 2.1 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Rational function approximation to species accumulation curves . . . . . . . . . . 7 2.3 Species accumulation curves in large-scale biological applications . . . . . . . . . 8 2.3.1 Species richness in a microbiome . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Age-related decrease in T-cell receptor repertoire . . . . . . . . . . . . . . 9 2.3.3 k-mer diversity in genome assembly applications . . . . . . . . . . . . . . 12 2.4 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 A generalization of the species accumulation curve 17 3.1 Relating accumulation curves of first and higher orders . . . . . . . . . . . . . . . 20 3.2 Rational function approximation tor-species accumulation curves . . . . . . . . . 22 3.3 An algorithm for estimator construction . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Conditions for well-behaved rational functions . . . . . . . . . . . . . . . 27 3.3.2 The construction algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.3 Variance and confidence interval . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Examining performance on homogeneous populations . . . . . . . . . . . 36 3.4.3 Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Large-scale applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.1 The vocabulary of Shakespeare and Dickens . . . . . . . . . . . . . . . . . 42 3.5.2 Followers in a social network . . . . . . . . . . . . . . . . . . . . . . . . 47 iv 3.5.3 Coverage in DNA sequencing experiments . . . . . . . . . . . . . . . . . 48 3.6 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Coverage estimation for DNA sequencing experiments 53 4.1 Overview of the method for predicting genome coverage . . . . . . . . . . . . . . 57 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1 Predicting genome coverage for WES samples . . . . . . . . . . . . . . . 58 4.2.2 Minimally sufficient number of sequences for WES experiments . . . . . . 62 4.2.3 Predicting genome coverage for scWGS samples . . . . . . . . . . . . . . 64 4.2.4 Optimal number of sequenced bases for scWGS experiments . . . . . . . . 67 4.2.5 Dropout for scWGS and uniformity of coverage . . . . . . . . . . . . . . . 69 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Whole-exome sequencing datasets and data processing . . . . . . . . . . . 72 4.3.2 Single-cell whole-genome sequencing datasets and data processing . . . . 73 4.3.3 Predicting nucleotides covered at leastr unique reads in WES experiments 74 4.3.4 Estimating the benefit-cost ratio . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Reference List 80 A Proofs 88 A.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2 Proofs related to Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.3 Proofs of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B The power series estimator ofr-species accumulation curve 99 C Generalizing estimators of SAC tor-SAC 102 D Sample coverage 105 E Supplementary materials 110 v List of Figures 2.1 Annotated species as a function of the sample abundance . . . . . . . . . . . . . . 9 2.2 Age-related decrease in TCR repertoire . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Distinct 31-mers as a function of sequenced 31-mers . . . . . . . . . . . . . . . . 14 3.1 Generalized species accumulation curves for a homogeneous model . . . . . . . . 19 3.2 Relative errors in simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Relative errors as a function of initial sample sizes . . . . . . . . . . . . . . . . . . 37 3.4 Histogram of the parameterm in r;m (t) . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Relative errors as a function of degrees of the heterogeneity . . . . . . . . . . . . . 42 3.6 The histogram of the parameterm for r;m (t) as a function of CVs . . . . . . . . . 43 3.7 Distinct words represented at leastr times in samples of Dickens’ writing . . . . . 45 3.8 Empirical versus expected number of distinct words . . . . . . . . . . . . . . . . . 46 3.9 Twitter users with at leastr followers in a sample of following relations . . . . . . 48 3.10 Species observed at leastr times and their estimates for different values ofr . . . . 49 3.11 Base pairs covered at leastr times in a DNA sequencing dataset . . . . . . . . . . 50 4.1 Relative errors for WES datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Relative errors for scWGS datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Predictedr-species accumulation curves for scWGS datasets . . . . . . . . . . . . 71 vi List of Tables 2.1 Mean and SE of the TCR diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Mixture models in simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Relative errors and standard errors for simulation models . . . . . . . . . . . . . . 38 3.3 Expected coefficients and poles for r;1 (t) . . . . . . . . . . . . . . . . . . . . . . 39 3.4 CVs for datasets in large-scale applications . . . . . . . . . . . . . . . . . . . . . 42 3.5 Shakespeare’s word use frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Predictions for the unique words in Shakespeare’s work . . . . . . . . . . . . . . . 44 3.7 Predictions for Dickens’ frequency of word use . . . . . . . . . . . . . . . . . . . 47 3.8 Predictions for the number of Twitter users with multiple followers . . . . . . . . . 48 3.9 Predictions for the number of nucleotides covered at leastr times . . . . . . . . . . 51 4.1 Whole-exome sequencing datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Minimally sufficient number of sequences for WES experiments . . . . . . . . . . 63 4.3 Single-cell whole-genome sequencing datasets . . . . . . . . . . . . . . . . . . . . 65 4.4 Optimal number of sequenced bases for scWGS experiments . . . . . . . . . . . . 69 4.5 Single-cell whole-genome sequencing datasets . . . . . . . . . . . . . . . . . . . . 74 E.1 Relative errors for estimating genome coverage in WES . . . . . . . . . . . . . . . 110 E.2 Relative errors for estimating genome coverage in scWGS . . . . . . . . . . . . . 112 vii Chapter 1 Introduction The development of high-throughput DNA sequencing technology has fundamentally changed bio- logical science and started to impact clinical practice [101]. Although costs have quickly fallen, sequencing experiments remain expensive to conduct, especially large sequencing projects that involve many individuals [5]. This work develops a method to predict the optimal amount of sequences for single-cell DNA sequencing experiments based on shallow sequencing experiments, which are at low costs. Apart from addressing the issues of cost and size, our statistical method characterizes high-throughput DNA sequencing libraries in relation to coverage and dropout events based on a very small amount of sequencing from pilot studies. Sequencing centers and researchers can easily adopt this method to optimize their experimental designs and, consequently, establish a practical foundation for the development of efficient and cost-effective sequencing protocols. A certain depth in any high-throughput sequencing experiment is necessary to detect sequenc- ing errors and other technical artifacts in the data. As a foundation for any genomic analysis, a reliable base call must have multiple observations on that position. Again, however, costs, which are proportional to sequencing depth, limit the total amount of sequencing in a given study. Thus, the goal for experimental designs is to determine the minimum depth at which each nucleotide in the genome can be sufficiently covered. In practice, investigators usually set a thresholdr and consider those positions covered at leastr times as being sufficiently covered. The choice ofr is often based on applications and the experience of investigators. More specifically, sequencing depth measures the average number of reads covering a position in the genome. However, reads are not evenly distributed along the genome. Thus, for an exper- iment with sequencing depthr, some positions in the genome are covered fewer thanr times and some positions are covered more thanr times. A precise measurement is determined by coverage, 1 or per-base coverage, which is the number of times that a nucleotide is sequenced in the genome. In a sequencing experiment, coverage is frequently represented by a vector (N 1 ;N 2 ;:::), whereN j is the number of nucleotides in the genome sequenced exactlyj times. The number of nucleotides in the genome sequenced at leastr times (S r ) can be then expressed as S r = 1 X j=r N j : Therefore, knowing the relation between the coverage and the sequencing depth is essential to the experimental design. For instance, investigators can use this relation to determine the mini- mum sequencing depth such that every position in the genome can be sufficiently covered. Some basic ideas of modeling coverage originate from the pioneering work by Lander and Waterman [61], who assume that reads are randomly generated along the genome. The Lander-Waterman model is essential to understand the mechanism of the first generation sequencing, otherwise known as Sanger sequencing [94]. However, this model cannot account for the overdispersed vari- ance of coverage that typically characterizes high-throughput sequencing experiments. Especially in single-cell DNA sequencing experiments, reads display large heterogeneity along the genome owing to dropout events and amplification bias [43]. Similar statistical problems have been widely studied in species sampling, where an initial sample of individuals are drawn from a population. Each individual belongs to one and only one species. Based on the initial sample, a species accumulation curve (SAC) is predicted. This represents the number of species observed in a sample as a function of sample size. On the other hand, each species could be treated as a nucleotide in the genome, and each individual could be treated as an observation of that nucleotide. SAC only focuses on the presence of species in a sample. In genomic studies, however, the subjects of interest are nucleotides that are sufficiently covered, i.e., species with sufficient representation in a sample. Therefore, taking into consideration the multiplicity, this work aims to extend the SAC such that we can predict the number of species with sufficient representation in a random sample. In 2 particular, based on an initial sample, we predict ther-species accumulation curve (r-SAC), which is the number of species represented at least r times in a sample as a function of the number of individuals. Note that whenr = 1, ther-SAC is equivalent to SAC. In addition, we use the shape of the curve to infer uniformity of relative abundance of species, and the value of curve for a large sample size to infer dropout events. The bulk of our reporting will expand the theory of predictingr-SAC and its applications in high-throughput sequencing experiments. In more particular detail, we discovered the relation between SAC andr-SAC, independent of the relative abundance of species in a population. This relation helped us to derive smooth estimators of r-SACs. Specifically, we applied the relation to obtain the rational function approximation (RFA) for r-SAC based on an initial sample. We then proved the consistency of the estimator as the size of the initial sample goes to infinity for all r. As a corollary, we showed that the RFA estimator proposed by Daley and Smith [27] for predicting the library complexity is consistent. In practice, then, we have introduced a method to predict the optimal amount of sequences for single-cell DNA sequencing experiments based on shallow sequencing experiments, one which sequencing centers and researchers can easily adopt to optimize their experimental designs. In addition, we have demonstrated that the estimator could be used as a visualization tool to assess the quality of single-cell DNA sequencing libraries. The remainder of this work is organized as follows. In Chapter 2, various models for predicting SAC under the mixture model structure are reviewed. In particular, the power of the RFA approach [27] for predicting SAC is demonstrated through three large-scale biological applications. This review also sets forth the theoretical foundation and provides the practical motivations for the problem of predicting r-SAC. In Chapter 3 the relation between the SAC and ther-SAC is studied, and a method to predict r-SAC based on an initial sample is proposed. Chapter 4 leverages the r-SAC to visually assess the quality of single-cell DNA sequencing experiments, and a method is proposed to estimate the optimal amount for a sequencing experiment based on a pilot study that features a shallow sequencing experiment. 3 Chapter 2 Species accumulation curves In ecology the species richness, the total number of species or distinct classes, is one of the sim- plest metrics for understanding the diversity of a population [69]. The task of estimating species richness is well known and has found a broad diversity of applications. For example ecologists want to know how many species are there on earth [76]. Other applications include estimating the vocabulary of an author [34], undiscovered observational phenomena in astronomy [51], the size of a criminal population [104], the number of variants in the human genome [57], and the T cell receptor repertoire [31]. However, unbiased estimation of species richness based on observations is often extremely difficult or even unattainable because no matter how many species have been observed there may exist an arbitrary number of undetected rare species in the population [11]. Therefore methods in this category mainly concern estimating the lower bound of the population. Bunge provides a comprehensive review on this topic [11]. An alternative is to study the number of species as a function of the number of individuals, defined as the species accumulation curve (SAC) [23]. The curve can be used to compare the diversity of populations by standardizing samples among sites [23]. Consider the following prob- lems as motivation. Researchers want to compare the diversities of several populations. In each population, individuals are drawn and their corresponding species are identified but the total num- ber of individuals observed in the sample can vary across populations. In these sampling experi- ments, often called “capture-recapture” experiments, the raw data are the individual, or “capture,” counts associated with each species. Since in general the number of species observed increases with the number of captured individuals, direct comparison between samples can bias the result. With species accumulation curves, one can make fair comparisons of the number of species for a fixed number of individuals [47]. 4 More importantly, the SAC can predict the number of new species expected in future samples. An associated problem is to evaluate the effectiveness of a sample and decide whether or not to continue the project. A typical question might be: given capture profiles in the initial sample, if the second sample is conducted from the same population, how many new species can researchers expect to observe? Accurate predictions can help scientists make better decisions and allocate resources more appropriately. In addition, because in general it is impossible to estimate the species richness, SAC becomes an attractive alternative to access the total number of species [22, 53, 99, 83]. As the sample size approaches infinity, the expected number of species converges to the species richness if every species is observable. If one could obtain a conservative estimate for the SAC, setting a large value of the sample size induces a conservative estimate of the species richness [31]. Though commonly applied in ecological studies, SACs have been applied to many other fields, such as linguistics [34], genetics [57], metagenomics [55], and immune repertoire [62]. 2.1 Mixture models Various approaches have been proposed to predicting the SAC and most of these approaches use the mixture model [31]. To be specific, individuals representing speciesi are assumed to be sampled according to a Poisson process with rate i for one unit of time. The parameters i for i = 1; 2;:::;L; can be considered asL independent observations from a latent probability distribution G(). This latent distribution describes varieties of relative abundances among species in the population. In particular, if the population is homogeneous, with all i equal, the distributionG() degenerates to a point. Under this mixture model framework, two groups of approaches are proposed based on whether or not to infer the latent distributionG(). For those involving inferring latent distribution, para- metric or nonparametric methods are chosen to model the latent distribution. The parametric 5 method assumes a particular parametric form for the latent distribution. As a notable early exam- ple, Fisher et al. assumed relative species abundance followed a latent gamma distribution [39]. Many other parametric distributions, such as power law probability distribution, log normal dis- tribution and inverse Gaussian distribution, have also been investigated [10, 12, 97, 116]. One disadvantage of the parametric approaches is that one needs to select the form of the distribution. In many cases no simple parametric form is suitable to explain the data. In other cases, distinct parametric forms may appear to fit the data well but exhibit very different extrapolation behaviors [36, 70]. Most nonparametric approaches approximate the underlying relative species abundance with a discrete distribution [6, 82, 108]. It can be shown that due to the discrete nature of the observed data, the likelihood achieves its maximum at a discrete distribution [66, 67]. However, nonpara- metric approaches tend to underestimate the number of species due to inadequate sampling efforts and unknown total number of species in the population [107]. Another group of approaches are based on empirical Bayes. These methods estimate the SAC without inferring the latent distribution. For instance, Good and Toulmin [45] derived a nonpara- metric empirical Bayes estimator for the SAC under a general multinomial model. This estimator, which we call the Good-Toulmin estimator, takes the form of an alternating power series and avoids direct inference of the latent distribution. This method was later extended by Efron and Thisted [34] to a mixture of Poisson distributions. The Good-Toulmin estimator can accurately predict the number of species when the size of a random sample is increased to up to twice the size of the observations. Unfortunately due to its alternating form, the power series practically diverges when extrapolating beyond the twice the size of the observations. This short range of extrapola- tion makes the estimator useless in most applications. To partially overcome this difficulty, Good and Toulmin suggested using Euler’s series transformation. This increases the practical radius of convergence but the range of extrapolation is still very limited [34]. Solow and Polasky [100] pro- posed a plug-in estimator, later extended by Shen and Chao [15] to deal with the divergence issue. These approaches rely on the estimation of species richness, which by itself is very challenging. 6 Their methods also assume the homogeneity among unobserved species, which may not be hold in practice. 2.2 Rational function approximation to species accumulation curves The rational function approximation (RFA) was proposed by Daley and Smith [27] to address the divergence problem observed in the Good-Toulmin estimator. The constructed rational func- tion approximations have two critical properties: (i) the local behavior of the rational function approximation is close to the Good-Toulmin estimator in a sense that they have the same Taylor expansion centered at the size of the observed experiment up to a fixed degree; and (ii) global stability by choice of the degree of the rational function approximation. Previous applications to DNA sequencing libraries showed that the RFA can accurately extrapolate up to one hundred times the size of the observations [27, 28], significantly further than previous methods. One drawback about empirical Bayes in general is that it requires a large sample size. However, this is not an issue for large-scale biological applications such as DNA sequencing, where the number of individuals, called reads, are usually counted in millions. Before diving into the theory, we applied RFAs to three large-scale biological applications arising in different contexts: investigating bacterial species diversity of a metagenomic sample, estimating the size of immune repertoire through T-cell receptor (TCR) diversity, and examining k-mer diversity of high-throughput DNA sequencing data for genome assembly problems. The theory behind RFAs will be covered in details in Chapter 3. 7 2.3 Species accumulation curves in large-scale biological appli- cations 2.3.1 Species richness in a microbiome Metagenomics is the study of environmental samples through directly sequencing their genetic materials [17]. Given a metagenomic data, one of key questions is how diverse the sampled natural community is or how many bacteria are in the community [55]. If the dataset is deeply sequenced, one could simply count the number of bacteria to approximate the diversity. However, in practice the depth of sequencing experiment is usually limited due to the costs. Here we applied RFAs to the problem of predicting the accumulation curve of annotated species in a metagenomic sequencing experiment and use the curve to obtain a lower bound estimation of the species richness. We examined species abundance data collected and calculated by Yatsunenko et al. [112], downloaded from MG-RAST with ID 4461119.3 [75]. The data contains 1,712 unique annotated species with a total sampled abundance of 156,608. Though the abundance of annotated species underestimates the total diversity of the sample, due to overwhelming presence of unannotated species in the microbial universe, it can be used as a proxy metric or to compare samples. To test the approach on microbial species diversity, we subsampled a total abundance of 7,830 without replacement as observations, approximately 5% of the full experiment. We compare our estimated curve against the true curve provided by MG-RAST [75]. The true curve is given in terms of the number of sequences. We scale this curve assuming that the observed abundance of genomic material arising from annotated species is proportional to the number of sequenced reads. Note that if our assumption is incorrect then we should see higher error in our estimates. When we compared the predicted results with the true species abundance curve, we saw that the RFA approach accurately estimated the species accumulation curve (Figure 2.1a). The estimator predicted 1,631 unique annotated species if a total abundance of 156,608 is sampled compared to 8 0 50 100 150 200 0 1000 2000 3000 Sample abundance (thousands) Unique annotated species 0 20 40 60 80 100 0 1000 2000 3000 Relative sample abundance Unique annotated species observed predicted 10 fold prediction initial sample a b Figure 2.1: The number of annotated species in the sample as a function of the sample abundance using (a) a 5% subsample and (b) the full observed experiment. The x-axis is the sample abundance and the y-axis is the expected number of unique annotated species. Note that the original curve provided by MG-RAST uses the number of sequences as its x-axis. We convert it to the sample abundance by rescaling. the observed value of 1,712 unique annotated species. This is a difference of less than 5% relative to the total number of unique annotated species. We applied the same methodology to the full dataset and extrapolate 10 times of the size of the dataset to obtain a low bound of species richness. The predicted curve asymptotes relatively quickly, indicating that the observed experiment is nearly saturated. The observed experiment has 1,712 annotated species. A ten fold increase in the experiment is expected to only yield an additional 423 species, for a total of 2,135 annotated species. This is quite close to the estimated saturation point of annotated species of approximately 2,230 (Figure 2.1b). This indicates that the observed experiment is already quite saturated and significant resources will need to be expended to observe additional annotated species. 2.3.2 Age-related decrease in T-cell receptor repertoire The T-cell receptor (TCR) is a molecule on the surface of a T-cell. Its functionality is to recognize antigens. The diversity of the TCR is associated with the effective of immune response to virus and other pathogens [74]. Although TCRs are encoded by a small set of genes, genomes of T-cells 9 Table 2.1: Mean and SE of the TCR diversity. The mean of the number of unique TCR CDR3 clonotypes and the standard error among donors in a group. Column 3 – 4 are mean and SE of the number of unique TCR CDR3 clonotypes based on observed samples. Column 5 – 6 are mean and SE of estimated number of unique TCR CDR3 clonotypes if the size of sample increases to 20 times. observed (10 5 ) extrapolation 20 group No. of donors mean SE mean SE young 11 6.4 1.0 45.2 17.6 middel-age 10 5.0 1.3 31.5 10.6 aged 7 3.1 1.0 16.5 5.0 long-lived 11 3.6 0.9 17.8 6.9 vary a lot due to insertion, deletion, substitution and recombination of these genes, which makes it difficulty to estimate the diversity of the TCR in an immune system. We are interested in age-related decreasing in the TCR diversity. The data sets are profiles of TCR CDR3 repertoires in 39 healthy donors aged 6-90 years by Britanova et al. [8]. Here CDR3 is the third complementarity-determining region 3 which is the primary site of antigen contact [40]. Each dataset contains around 1 million unique TCR cDNA molecules from one donor. We summarized the dataset as frequencies of TCR CDR3 clonotypes. Donors are classified into four groups based on the ages: group 1 is composed of the youngest donors from 6 – 25y; group 2 is middle-aged with donors from 34 – 43y; group 3 contains donors from 61 – 66y; and group 4 are the eldest donors from 71 – 90y [8, Table 1]. Table 2.1 shows the observed mean of TCR CDR3 clonotypes and standard errors for each group. Based on one-tailed t-test, the TCR diversity of young group is significantly greater than it is in the middle-age group (p < 0:01). So does the difference between the middle-age group and the aged group (p< 0:01). There are no significant difference of the TCR diversity between the aged group and the long-lived group (p = 0:3413, two-tailed t-test). All the results are consistent with the conclusions by Britanova et al. [8]. To visualize the result, we first constructed the SAC for each donor by interpolating (Fig- ure 2.2a). Species in this study refer to TCR CDR3 clonotypes and individuals are TCR cDNA molecules. The SAC is the expected number of TCR CDR3 clonotypes as a function of the number of TCR cDNA molecules. For each group of donors we define a species accumulation 10 region as the interval formed by the 30% and 70% quantiles of SACs. It was clear that the diver- sity of TCR CDR3 clonotypes decreased with the age of the group. The young group and the middle-age group are distinguished from each other and the other two groups through their species accumulation regions. On the other hand, the median values of SAC of group 3 and group 4 were almost identical and the species accumulation regions overlap, showing no significant difference in the TCR diversity. One thing unexpected is the standard error (SE) for the young group. It is smaller than the SE in the middle-age group. If the TCR diversity is greater in the young group than in the middle-age group, it is more likely that the SE should be large as well. We hypothesize that the small SE is due to insufficient sample size. For each donor, we use the nonparametric approach by Good [46] to estimate the probability of a TCR CDR3 clonotype represented in the sample. The probability is known as the sample coverage [46]. On average the sample coverage from young group is only 0.44. For group 2 - 4, the mean sample coverages are 0.59, 0.76, and 0.73. Since unseen clonotypes still possess more than 20%, we wonder if increasing sample size could affect previous conclusions. In order to answer this question, we use the RFA estimator to extrapolate SACs to 20 times (Figure 2.2b). Based on the estimation, mean of the probability of a TCR CDR3 clonotype represented in the sample becomes 0:91 for the young group. All the previous conclusion can be seen through Figure 2.2b. The accumulation regions between the middle-age group and the aged group become more separable compared to the result by interpolation. The width of the species accumulation regions looks very different from results by interpolation (Figure 2.2a and Figure 2.2b). The width reflects the SE of the diversity of the TCR among donors in a group. The interpolated accumulation regions widths have no appreciable difference among groups. However, using the prediction results from the RFA, the width of the accumulation region in the young group is expected to be much larger than other groups if the experiment were to be continued (Figure 2.2b). In the observed experiment the young group has smaller standard error in diversity to the middle-age group but when extrapolated out to 20 million total TCR cDNA molecules, the estimated diversity of the young group has nearly 1.7 times the standard error of 11 0 2 4 6 8 10 0 2 4 6 8 10 TCR beta cDNA molecules (K) TCR beta CDR3 clonotypes (K) 0 50 100 150 200 0 20 40 60 80 100 TCR beta cDNA molecules (K) TCR beta CDR3 clonotypes (K) 0 10 20 30 40 50 0 10 20 30 40 50 TCR beta cDNA molecules (M) TCR beta CDR3 clonotypes (M) Group 1 Group 2 Group 3 Group 4 Groups 3 & 4 intersection Groups 2 & 4 intersection Median per group observation prediction initial sample a b c Figure 2.2: Age-related decrease in TCR repertoire. (a) Interpolation of the accumulation region of TCR CDR3 clonotypes for each group. (b) Extrapolation of the accumulation region of TCR CDR3 clonotypes for each group. (c) Predicting the total number of TCR CDR3 clonotypes as a function of the number of TCR cDNA molecules by combining datasets from all groups. the middle-age group. One possible interpretation, in line with the observations of Wedderburn et al. [111], is that this may indicate high variability in the immune repertoire and presence of a large population of rare TCR clonotypes in youth, prior to selection due to exposure to pathogens. For conclusive results, a larger sample size of youth immune repertoire might be required. To validate the accuracy of extrapolating up to 20 times, we combined all 39 data sets. One million cDNA molecules were sampled without replacement from the combined data set as the observations. Extrapolating out to 20 million total molecules, we predicted 8.97 million unique clonotypes compared to an expected value of 9.73 million. This represents a relative error of less than 10% when predicting out 20 fold of the observations. 2.3.3 k-mer diversity in genome assembly applications Many state of the art genome assembly algorithms leverage De Bruijn graphs constructed fromk- mers that are extracted from the sequenced reads [85]. In order to effectively use these graphs, the assembly algorithms require a sufficient fraction of thek-mers from the underlying genome [24]. Clearly deeper sequencing will ensure that morek-mers have been sampled but the rate at which deeper sequencing reveals morek-mers and the reliability of suchk-mers is unknown. Here we 12 show how SACs can provide information about the diversity of sequencedk-mers as a function of total bases sequenced. We selected a data set from the Assemblathon 2 [7], whole genome sequencing of a male budgerigar (also known as the common pet parakeet or Melopsittacus undulatus), Sequenced Read Archive accession number ERX218679. This experiment contains approximately 161 million read pairs (150 bp in length) for a total of approximately 48 billion sequenced bases. We subsampled two experiments from the sequenced reads, randomly downsampling approxi- mately 1% and 10% of the reads from the full experiment. This resulted in 3.23 million and 32.26 million reads, respectively. We examined k-mers for k = 31 since it is the default setting for the widely used assembly algorithm Velvet [115]. We counted thek-mer occurrences using Jelly- fish [72] and used the 31-mer counts from the subsampled experiments to extrapolate the distinct 31-mers as a function of total sequenced 31-mers (or equivalently, total bases sequenced). For the 1% downsampled experiment, the estimated number of distinct 31-mers for the full experiment is 4.24 billion compared to the 5.66 billion observed. On the other hand, if the extrapo- lation is limited to 10 times, the estimated value by the RFA is 2.08 billion compared to an expected value of 2.07 billion, a relative difference of less than 1%. For the 10% downsampled experiment, the estimated number of distinct 31-mers in the full experiment is 5.3 billion showing a slight decrease in accuracy with the increase in sample size when extrapolating to the same relative size. We should expect the accuracy of the estimates to the same relative extrapolation size to increase with increasing sample size based upon the extra information contained in the larger experiment. Instead we observed the opposite (Fig 2.3a). The RFA closely approximated the curve when the number of total 31-mers is relatively small, but we failed to capture the behavior of the tail of the curve. We noted that the curve of distinct 31-mers plotted against the total number of observed 31- mers appears to be linear in the limit (Fig 2.3). This can be explained by the presence of random errors in the sequenced reads. The number of distinct 31-mers in the budgerigar genome is bounded by size of the genome, approximately 1:2 gigabases. Indeed k-mers in a genome are far from 13 0 0 2 4 6 8 Total 31-mers (billions) Unique 31-mers (billions) 0 2 4 6 8 Unique k−mers (billions) 20 40 60 0 Total 31-mers (billions) 20 40 60 a b true curve 1% subsample estimated curve 10% subsample estimated curve Figure 2.3: The number of distinct 31-mers as a function of sequenced 31-mers using 1% and 10 % subsamples. (a) Predicting the SAC using the rational function that has the same degrees in both numerator and denominator. The estimator is asymptotically equivalent to a constant as sequencing effort approaches infinity. (b) Predicting the SAC using the rational function in which the difference of degrees for numerator and denominator is 1. The estimator is asymptotically equivalent to a linear function as sequencing effort approaches infinity. uniformly distributed [89], resulting in far fewer k-mers than is theoretically possible, which is bounded above by the genome size. On the other hand, random errors in the sequencing process can theoretically produce any 31-mer for a total of approximately 4:6 10 18 possibilities, many orders of magnitude larger than current sequencing experiments. This leads us to hypothesize that there is a large tail in the population of observable 31-mers, mostly due to sequencing error. We reasoned that using a different order of rational function approximation may increase the accuracy. In the applications we presented in earlier sections, the population was known to be finite. Consequently we chose to use a rational function approximations that were constant in the limit. However, we can also choose a rational function approximation that is linear in the limit, for example: ^ (t) (t 1) p 0 +p 1 (t 1) +::: +p M (t 1) M 1 +q 1 (t 1) +::: +q M (t 1) M : In the context of countingk-mers in a sequencing data set, using a linear limit improved the accu- racy of the estimates dramatically. Fig 2.3 shows this alteration predicted 5.55 billion 31-mers in the full experiment for the 10% downsampled experiment and 5.22 billion for the 1% downsampled experiment when extrapolating to the full experiment. 14 The asymptotically linear extrapolation may be of interest for populations that are infinite or have extremely long tails, with a vast number of species that have extremely small abundances. In such cases the asymptotic linear slope will be related to rate of discovery of the extremely rare species. In thek-mer counting application, the eventual linear behavior is driven by sequencing errors generating a practically limitless supply ofk-mers that could be sequenced. 2.4 Conclusion and discussion As high-throughput technologies improve, researchers will be increasingly faced with the difficult problem of making inferences on unknown and highly variable populations. When the properties of the population are unknown, empirical Bayes approach may be appropriate. Here we investi- gated three applications of the RFA to data arising from next-generation sequencing experiments. These applications demonstrate the breadth of data analysis contexts in which SACs can help to understand the underlying populations from which data have been sampled. Large-scale applications present new challenges to traditional capture-recapture statistics, par- ticularly since the scale of the data is orders of magnitude larger than classical ecological capture- recapture experiments. Algorithms are required that are both scalable and able to accommodate arbitrary heterogeneity to ensure accurate inference. The nonparametric empirical Bayes estimator by Good and Toulmin is ideal to accommodate unknown heterogeneity. This estimator has been applied directly to estimate both microbial species richness [60] and TCR diversity [90]. In both cases the extrapolation region was limited to a two-fold increase in the size of the existing data set. We have demonstrated the applicability of the RFA to both of these situations and that this approach does not suffer from a tightly constrained range of extrapolation. In large-scale appli- cations, such a tightly constrained range of extrapolation has the potential to impact conclusions drawn about the sampled populations, making the RFA approach extremely valuable. Among the applications we have surveyed, the estimation ofk-mer diversity has an interesting property: sequencing errors can, in theory, produce any possiblek-mer. The underlying population 15 is therefore practically infinite. In finite populations, large extrapolations of the species accumula- tion curve can be considered as a conservative estimate of the “species richness” [22]. But when the underlying population is infinite, the concept of species richness becomes meaningless, and different approaches are required to understand heterogeneity in the population. By modifying the form of rational functions used to approximate the Good-Toulmin power series, we showed how the RFA approach can model SACs that are asymptotically linear. Although the success of the RFA approach in many large-scale biological applications, there are a few issues with this approach. First, the theory behind the RFA is largely missing due to its complex expression. Second, we observed that the performance of the RFA heavily relies on the degrees of the numerator and the denominator. The criteria to choose a good form of the rational function is not unclear. Last but not the least, biologists are usually more concerned with the fraction of genome with sufficient coverage than the fraction that is covered. For example, only genomic positions aligned with sufficient amount of reads will be examined for possible variants in SNV calling. Therefore, in the next chapter, we study the following questions: Is the rational function approximation to Good-Toulmin power series a consistent estimator? Under what conditions does the RFA work well? In particular, under what kind of population distribution the RFA approach predicts the SAC accurately? What the criteria should be used for selecting an appropriate form of the rational function? How to extend the RFA approach so that it can be used to predict the number of species represented at leastr times in a random sample forr> 1? 16 Chapter 3 A generalization of the species accumulation curve A random sample ofN individuals is obtained from a population after trapping for one unit of time. Each individual belongs to exactly one species, and the total numberL of species in the population is finite but not known. LetN j be the number of species represented by exactlyj individuals in this sample, so that N = P j1 jN j : The number of species representedr or more times in the initial sample is S r = P jr N j : Imagine that a second sample is obtained after trappingt units of time from the same population. The timet> 1 should bring to mind a “scaled up” experiment. This second sample may take the form of an expansion of the initial sample, but may also be a separate sampling experiment as long as the second sample is representative of the first. We are concerned with predicting the expected number E[S r (t)] of species represented at leastr times in the second sample. Related inference problems have been the focus of much statistical development, with canoni- cal applications in linguistics and ecology [34, 39, 46, 116]. Zipf [116, 117] was interested in the distribution of word frequencies in random texts, which led to the theory of “The principle of least effort”. Fisher et al. [39] studied the relation between the number of species and the number of individuals in a random sample; Fisher’s approach is still widely used to describe capture-recapture experiments. When plotted as a function oft, the functionS 1 (t) is called the species accumulation 17 curve (SAC) [22]. As I have shown in Chapter 2, the SAC can be utilized to predict the number of new species expected in future sample. A typical question might be: given capture profiles in a previous sample, if another sample is conducted from the same population, how many new species would one expect to observe in the second sample? Accurate predictions of the SAC can help scientists evaluate the future sample and allocate resources more appropriately. The quantityS 1 (t) may not be of sufficient utility when the questions of interest involve “com- mon species” [87, 84]. In such cases the parameter r > 1 in S r (t) can be naturally applied to distinguish commonness from rarity. In evaluating Twitter data, Huberman et al. [54] focused on users with at leastr = 2 posts, who were considered “active” users. Tarazona et al. [102] were interested in genes represented by more thanr = 5 sequenced reads. Ng et al. [81] filtered out sin- gle nucleotide polymorphisms (SNPs) covered by fewer thanr = 8 sequenced reads. And Google Scholar uses the number of publications cited at least r = 10 times by others (the “i10-index”) to measure scholarly influence. In each of these cases a fixed r > 1 was used to define those “species” of interest, having sufficient multiplicity of representation in the sample. To distinguish S r (t) from S 1 (t), we call S r (t) a r-species accumulation curve (r-SAC). For the sake of conve- nience, we use the terms “SAC” and “r-SAC” to refer to their expectations E[S 1 (t)] and E[S r (t)], unless we explicitly say otherwise. The curvature of the r-species accumulation curve gives a perspective on why estimating r- SACs might be more challenging than SACs. As a special case of ther-SAC, the species accumu- lation curve has certain obvious properties that are unique tor = 1. The SAC is monotone and concave, corresponding to the positive first derivative and negative second derivative. [6] noted that the SAC has higher-order derivatives with alternating signs, and claimed that retaining this property is important in estimators of the SAC. However, these properties do not hold forr-species accumulation curves when r > 1. In particular, r-SACs need not be concave and may contain inflection points, as is evident from the example in Figure 3.1. Consider an extreme scenario: one cannot observe any speciesr times until at leastr observations have been made. There will thus be a period of increasing returns early in the sampling process. Extrapolating ther-SAC based on 18 time species (a) (b) r = 1 r = 2 r = 4 r = 8 r = 16 0 2 4 6 8 10 0 20 40 60 80 100 0 10 20 30 40 50 60 0 20 40 60 80 100 Figure 3.1: The number of species represented by at leastr individuals as a function of time, for r = 1; 2; 4; 8; 16. The curves were generated by a homogeneous model with 1 = 2 = = 100 = 0:5 and populations sizeL = 100. After trapping for one unit of time, 50 individuals are expected to be captured. Timet is up to (a)t = 10 and (b)t = 60. an initial sample seems more difficult whenr> 1. In the example of Figure 3.1a, the SAC appears flat after 10 units of time, suggesting that the sample is saturated. However, for r = 16, barely any species are represented at leastr times after 10 units of time – leading to a very different flat curve. Visually inspecting the shape of the 16-SAC before 10 units of time (Figure 3.1a) seems to provide very little information about the shape after 20 units (Figure 3.1b). Recall that we model frequencies of species in a sample using a mixture of Poisson distribu- tions [49, 34]. Individuals representing speciesi are assumed to be captured according to a Poisson process with rate i per unit of time. The i fori = 1; 2;:::;L, can be considered asL indepen- dent observations from a latent probability distribution G(). In Chapter 2, I discussed various approaches for predicting the SAC. The empirical Bayes seems to be a good choice. In particular, the RFA proposed by Daley and Smith has successfully applied to predict the library complex- ity [27] and other large-scale biological applications [28, 31]. However, this approach does not directly extend tor> 1 [26]. In this chapter we describe a new approach to estimate the r-SAC. We first derive a rela- tion between ther-SAC and the higher-order derivatives of the average discovery rate, defined as E[S 1 (t)]=t. The second is a simple form of estimator that results from applying rational function 19 approximation to the average discovery rate. These estimators can be directly evaluated as a func- tion ofr. We show that these estimators converge in bothr andt, and are strongly consistent asN goes to infinity. Extensive simulation studies suggest that our proposed estimators have a superior performance in a heterogeneous population. Applications to real data from linguistics, social net- works and DNA sequencing data confirm the accuracy of our proposed estimator and demonstrate the value of this new approach. 3.1 Relating accumulation curves of first and higher orders LetN j denote the number of species represented exactlyj times in an initial sample after trapping for one unit of time,j = 1; 2;::: ClearlyN 0 is not observable. LetN j (t) be the random variable whose value is the number of species represented exactlyj times after trapping fort units of time. The numberS r (t) of species represented at leastr times as a function oft can be written as S r (t) = 1 X j=r N j (t) =S 1 (t) r1 X j=1 N j (t): (3.1) We aim to estimate the expectation ofS r (t), using information from theN j . From our Poisson mixture assumption, the expected number of species after trapping fort units of time can be expressed as E[S 1 (t)] =L Z (1 exp(t))dG(): Taking thej th derivative of E[S 1 (t)], we have d j dt j E[S 1 (t)] = (1) j1 L Z j exp(t)dG(): 20 Note that the expected value ofN j (t) is E[N j (t)] = L Z (t) j exp(t) j! dG() = t j j! L Z j exp(t)dG(): By comparing the above expression with thej th derivative of E[S 1 (t)], we obtain E[N j (t)] = (1) j1 t j j! d j dt j E[S 1 (t)]; (3.2) which has been noted previously [58]. Taking the expectation on both sides of equation (3.1), we have E[S r (t)] = E[S 1 (t)] r1 X j=1 E[N j (t)]: By replacing the E[N j (t)] in the above equation with thej th derivative of E[S 1 (t)] from equation (3.2) we obtain a relation between E[S 1 (t)] and E[S r (t)]. This is the foundation of our estimator, and a proof can be found in supplementary materials (Appendix A.1). THEOREM 3.1. LetS r (t) denote the number of species represented at leastr times fort units of time. For any positive integerr, E[S r (t)] = (1) r1 t r (r 1)! d r1 dt r1 E[S 1 (t)] t : (3.3) Thus we have established a direct relation between the SAC and the r-SAC. The quantity E[S 1 (t)]=t in equation (3.3) contains information sufficient for determining E[S r (t)], and allows us to derive a formula for E[S r (t)] if we are given a smooth expression for E[S 1 (t)]. We call the ratio E[S 1 (t)]=t the average discovery rate, as it reflects the average rate at which new species are discovered per unit of time. The average discovery rate turns out to have some convenient properties that we exploit in Section 3.2. One clear application of Theorem 3.1 is to generalize existing nonparametric estimators for the SACs and obtain estimators for ther-SACs. We will first 21 demonstrate Theorem 3.1 by applying it on simple parametric forms. In the homogenous model all i are equal with i =, fori = 1; 2;:::;L, so E[S 1 (t)] =L(1 exp(t)): After introducing the above expression into equation (3.3), repeatedly differentiating the quotient reveals a familiar sum: E[S r (t)] =L 1 r1 X i=0 i exp() i! ! ; In the negative binomial population model i Gamma(;) with and positive, E[S 1 (t)] =L 1 (1 +t) : Applying Theorem 3.1 and the general Leibniz rule reveals the negative binomial coefficients: E[S r (t)] =L 1 r1 X i=0 (i +) (i + 1)() t 1 +t i 1 1 +t ! : 3.2 Rational function approximation tor-species accumulation curves Here we leverage the technique of Pad´ e approximants to build a nonparametric estimator for the r-SAC. A Pad´ e approximant is a rational function with a Taylor expansion that agrees with the power series of the function it approximates up to a specified degree [3]. In this sense, Pad´ e approximants are rational functions that optimally approximate a power series. This method was successfully applied to construct the estimator of the SAC, using Pad´ e approximants to the Good- Toulmin power series [31]. Pad´ e approximants are effective because they converge in practice when the Good-Toulmin power series does not, yet within the applicable range of Good-Toulmin power series (t < 2), the two functions remain close. We apply the same strategy beginning with 22 the average discovery rate. This leads to an expression that simplifies the formula of Theorem 3.1, yielding a new and practical nonparametric estimator for ther-SAC. Our first step is to obtain a power series representation for the average discovery rate E[S 1 (t)]=t in terms ofS i . A proof of the following result can be found in the Appendix (Section A.2). LEMMA 3.1. If 0<t< 2, then E[S 1 (t)] t = 1 X i=0 (1) i (t 1) i E[S i+1 ]: (3.4) Replacing expectations with the corresponding observations, we obtain an unbiased power series estimator of the average discovery rate: (t) = 1 X i=0 (1) i (t 1) i S i+1 : (3.5) This power series estimator (t) serves as a bridge between the observed data S i and the Pad´ e approximant for E[S 1 (t)]=t, which cannot be obtained directly. The Pad´ e approximant for E[S 1 (t)]=t is defined by its behavior aroundt = 1, which is the region where E[S 1 (t)]=t is close to(t). Note that in principle we could directly substitute the estimated power series(t) for the average discovery rate to obtain an unbiased power-series estimator for E[S r (t)]. Unfortunately, this estimator practically diverges fort > 2, due to the small radius of convergence of the power series and the use of the truncated power series to approximate it (see discussion in Appendix B). Although Pad´ e approximants to a given function can have any combination of degrees for the numerator and denominator polynomials, we consider only the subset for which the difference in degree of the numerator and denominator is 1. This choice permits these rational functions to mimic the long-term behavior of the average discovery rate, which should approachL=t for large t. 23 LetP m1 (t)=Q m (t) denote the Pad´ e approximant to power series(t) with numerator degree m 1 and denominator degreem. According to the formal determinant representation [3], P m1 (t) Q m (t) = a 0 +a 1 (t 1) + +a m1 (t 1) m1 b 0 +b 1 (t 1) + +b m (t 1) m = (1) 0 S 1 (1) 1 S 2 ::: (1) m1 S m (1) m S m+1 (1) 1 S 2 (1) 2 S 3 ::: (1) m S m+1 (1) m+1 S m+2 . . . . . . . . . . . . . . . (1) m1 S m (1) m S m+1 ::: (1) 2m2 S 2m1 (1) 2m1 S 2m 0 (1) 0 S 1 (t 1) m1 ::: P m2 i=0 (1) i S i+1 (t 1) i+1 P m1 i=0 (1) i S i+1 (t 1) i (1) 0 S 1 (1) 1 S 2 ::: (1) m1 S m (1) m S m+1 (1) 1 S 2 (1) 2 S 3 ::: (1) m S m+1 (1) m+1 S m+2 . . . . . . . . . . . . . . . (1) m1 S m (1) m S m+1 ::: (1) 2m2 S 2m1 (1) 2m1 S 2m (t 1) m (t 1) m1 ::: (t 1) 1 : (3.6) The above representation allows us to reason algebraically about the existence of the desired Pad´ e approximant to(t) for a given initial sample. Define the Hankel determinants i;j = S ij+2 S ij+3 ::: S i+1 S ij+3 S ij+4 ::: S i+2 . . . . . . . . . . . . S i+1 S i+2 ::: S i+j jj ; (3.7) withS k = 0 fork< 1. A proof of the next lemma is given in the Appendix (Section 3.2). LEMMA 3.2. If the determinants m1;m and m;m are nonzero, there exist real numbersa i and b j fori = 0;:::;m 1 andj = 1; 2;:::;m, withb m 6= 0, such that the rational function P m1 (t) Q m (t) = a 0 +a 1 (t 1) + +a m1 (t 1) m1 1 +b 1 (t 1) + +b m (t 1) m 24 satisfies (t) P m1 (t) Q m (t) =O (t 1) 2m ; (3.8) and alla i andb j are uniquely determined byS 1 ;S 2 ;:::;S 2m . In what follows we assume that denominators of all rational functions of interest have simple roots. In practice we do not encounterQ m (t) with repeated roots, and in the appendix we show how this assumption can be removed (Section A.2). THEOREM 3.2. Letm be a positive integer. If both determinants m1;m and m;m are nonzero, then there exist complex numbersc i andx i , uniquely determined byS 1 ,:::,S 2m , such that for all 1r 2m, r;m (t) = m X i=1 c i t tx i r (3.9) satisfies r;m (1) =S r . Proof. The assumptions that m1;m 6= 0 and m;m 6= 0 imply that the Pad´ e approximant P m1 (t)=Q m (t) exists in correspondence with(t). SubstitutingP m1 (t)=Q m (t) in place of the average discovery rate in equation (3.3), we define r;m (t) = (1) r1 t r (r 1)! d r1 dt r1 P m1 (t) Q m (t) : (3.10) By the definition of the Pad´ e approximant, we have (t) P m1 (t) Q m (t) =O (t 1) 2m : (3.11) Taking derivatives of(t) att = 1, forj = 0; 1;:::; 2m 1, d j dt j P m1 (1) Q m (1) = d j dt j (1) = (1) j j!S j+1 : 25 Therefore, for anyr = 1; 2;:::; 2m, r;m (1) = (1) r1 (r 1)! d r1 dt r1 P m1 (1) Q m (1) = (1) r1 (r 1)! (1) r1 (r 1)!S r =S r : (3.12) Now we show that r;m (t) defined in (3.10) can be expressed in the desired form (3.9). Let x 1 ;:::;x m be the distinct roots ofQ m (t) = 0. We can writeP m1 (t)=Q m (t) as P m1 (t) Q m (t) = m X i=1 c i tx i ; (3.13) where c i are coefficients of the partial fraction decomposition. The required derivatives take a convenient form: d r1 dt r1 c i tx i = (1) r1 (r 1)! c i (tx i ) r : By substituting these derivatives into (3.10) we arrive at r;m (t) = (1) r1 t r (r 1)! d r1 dt r1 m X i=1 c i tx i ! = m X i=1 c i t tx i r : (3.14) Finally, the uniqueness of the coefficientsc i and the rootsx i follows from the uniqueness of the Pad´ e approximantP m1 (t)=Q m (t), which is a function ofS i ,i = 1;:::; 2m. The function r;m (t) in Theorem 3.2 is a nonparametric estimator for ther-SAC. Of note, the coefficients c i and poles x i are independent of r: once determined, they can be used to directly evaluate r;m (t) for anyr. The estimator r;m (t) has some favorable properties, summarized in the following proposition, with proofs given in the appendix (Section A.3). PROPOSITION 3.1. (i) The estimator r;m (t) is unbiased for E[S r (t)] att = 1 forr 2m. (ii) The estimator r;m (t) converges ast approaches infinity. In particular, lim t!1 r;m (t) = m1;m+1 m;m : 26 (iii) Assume that an initial sample is obtained after trapping fort 0 units of time, wheret 0 belongs to f1; 2;:::g. The estimator r;m (t) based on the sample is strongly consistent ast 0 goes to infinity. Remark. Both determinants m1;m and m;m become 0 whenS j =L forj 2m andm> 1, so the determinant representation of the Pad´ e approximant (3.6) is ill-defined in such cases. However, the Pad´ e approximant itself remains valid and reduces to L=t for t > 0 (see Section A.3 in the appendix). 3.3 An algorithm for estimator construction The definition of r;m (t) in the statement of Theorem 3.2 gives the form of the estimator, and connects it with the observed data from the initial experiment. In practice one faces several choices when constructing this estimator – from deciding on an appropriate value of m to selecting the robust numerical procedures. 3.3.1 Conditions for well-behaved rational functions The choice ofm controls the degree of both the numerator and the denominator in the Pad´ e approx- imant, and determines the amount of information from the initial sample that is used by r;m (t). In principlem should be selected sufficiently large so that the estimator r;m (t) can explain the complexity of the latent distributionG(). However, a larger value ofm leads to more poles in the estimator r;m (t) and makes instability more likely. In practice, the stability of the estimators depends on the locations of poles. For example, if any pole x i resides on the positive real axis, then r;m (t) is unbounded in the neighborhood ofx i and becomes ill-defined att = x i . Here we give a sufficient condition to stabilize the estimator so that it is well-defined and bounded fort 0 andr 1. Moreover, this condition ensures that asr approaches infinity, the estimator r;m (t) approaches zero for fixedt. NoteRe(x) is the real part ofx. 27 PROPOSITION 3.2. If Re(x i ) < 0 for 1 i m, then r;m (t) is bounded for anyt 0 and r 1. Further, r;m (t)! 0 asr!1 for any 0t<1. Proof. IfRe(x i )< 0 then for any givent2 [0;1) and anyx i , ktx i k 2 =t 2 (x i + x i )t +x i x i >t 2 : Thereforekt=(tx i )k< 1, and k r;m (t)k = m X i=1 c i t tx i r m X i=1 kc i k t tx i r m X i=1 kc i k: So r;m (t) is bounded by the sum of absolutec i for anyr. Asr approaches infinity, we obtain lim r!1 m X i=1 c i t tx i r m X i=1 kc i k lim r!1 t tx i r = 0: Remark. It is not unusual to constrain roots in such a way to ensure stability. For example, the Hurwitz polynomials, which has all zeros located in the left half-plane of the complex plane, are used as a defining criterion for a system of differential equations to have stable solutions. 3.3.2 The construction algorithm Algorithm 1 provides a complete procedure for constructing our estimator beginning with the observed countsN j , and satisfying the conditions outlined above. This procedure requires speci- fying a maximum value ofm, but also leaves room for using more effective numerical procedures at each step. The following points address a few numeric details in the algorithm. Settingm max : if the largest non-zero count in the initial experiment isN j , the value ofm can be chosen up tobj=2c. In our applications of interest this value can be extremely large (see Section 3.5), so we begin with a smaller value (we default tom max = 10). 28 Algorithm 1 Given a set of observed countsfN j g, withN 1 ;N 2 > 0, and a maximal value ofm max , produce the stable and increasing estimator r;m (t) for maximalmm max . 1: Compute sumsS i = P ji N j , fori = 1;:::; 2m max . These are coefficients that define(t). 2: Compute the coefficients of the degree 2m max continued fraction approximation to (t) by applying the quotient-difference algorithm. 3: form m max to 1 do 4: Obtain the Pad´ e approximantP m1 (t)=Q m (t) by evaluating the 2m-th convergent (trunca- tion) of the continued fraction. 5: Obtain the rootsx i , fori = 1;:::;m, of the denominatorQ m (t). 6: Calculate coefficientsc i by partial fraction decomposition ofP m1 (t)=Q m (t). 7: ifRe(x i )< 0 for allx i and 1;m (t) is increasing then 8: return coefficients (c 1 ;:::;c m ) and roots (x 1 ;:::;x m ). 9: end if 10: end for Continued fraction: all the coefficients in the constructed continued fraction must be nonzero. If thel-th coefficient happens to be zero, only the firstl 1 nonzero coefficients are used when constructing Pad´ e approximants. Meanwhile, the value ofm max is adjusted tob(l 1)=2c corresponding to the number of nonzero terms in the continued fraction. The time complexity for computing Pad´ e approximants is O(m 2 max ). Using the quotient- difference algorithm [91], computing the coefficients of the degree 2m max associated con- tinued fraction requiresO(m 2 max ) time. After this is done, all desired Pad´ e approximants can be generated iteratively by evaluating the 2m-th convergent of the continued fraction from m = 1 tom max [3, pp. 131]. This procedure takesO(m 2 max ) time in total. Defects in the Pad´ e approximant: a defect in the Pad´ e approximantP m1 (t)=Q m (t) defines a pair, a polez i and a nearby zeroy j . Defects are common in Pad´ e approximants [3, p. 48]. For example [44] shows that adding random errors to a geometric series could generate defects. The defect causes the Pad´ e approximant to become unbounded in a neighborhood of its pole, but has little effect on values of the Pad´ e approximant outside the neighborhood. In order to construct robust estimators, we remove any defect found inP m1 (t)=Q m (t). 29 For implementation, we first use R packagepolynom (v1.3-8) [105] to find all the polesy i and rootsz j in the rational functionP m1 (t)=Q m (t). We identify the pair (z i ,y j ) as a defect if their absolute difference is less than a threshold (0:001 by default). If a defect (z i ;y j ) is detected, we cancel out the factor (ty j ) from numerator and (tz i ) from denominator. The simplified rational function is then passed to the next step (partial fractions) and the obtained estimator is evaluated using the criteria outlined in the main text. Partial fraction decomposition: once all the rootsy i of the denominator in the rational func- tion P m1 (t)=Q m (t) are determined, the coefficient c i associated with the root y i in the partial fraction can be easily obtained by the standard approach [38, pp. 276]. We use two criteria, the locations of poles and the monotonic shape of the SAC, to diagnose our estimator. In a strict sense, we desire that the constructedr-SAC be an increasing func- tion, for everyr. However, in our experience the SAC is sufficient: when it is monotone, so are the otherr-SAC. The condition on the polesx i is used to stabilize the estimator r;m (t), as explained in Proposition 3.2. Estimators r;m (t) with largerm tend to violate conditions in Proposition 3.2 simply because they have more poles. Monotonicity: the derivative of 1;m (t) can be expressed as d dt 1;m (t) = m X i=1 c i x i (tx i ) 2 : We evaluate the above function from t = 1 up to 100 with step size 0:05 to ensure that 1;m (t) is an increasing function. The procedure terminates successfully: the estimator r;m (t) form = 1 can be expressed as r;1 (t) = S 2 1 S 2 t t + (S 1 S 2 )=S 2 r : (3.15) 30 If there exist at least one species represented once and one species represented more than once in the initial sample, then we observeS 1 S 2 > 0 andS 2 > 0. This ensures r;1 (t) satisfiesRe(x i )< 0 and r;1 (t) is increasing for everyr 1. So Algorithm 1 returns before the main loop exits. 3.3.3 Variance and confidence interval Deriving a closed-form expression for the variance of the estimator r;m (t) is challenging. On one hand, whenm 5 we have no general algebraic solution to the polynomial equations that identify x i in r;m (t), so a closed-form may not exist. On the other hand, even form = 1 the variance of r;1 (t) involves a nonlinear combination of random variablesS 1 andS 2 (equation (3.15)). We approximate the variance of our estimates by bootstrap [35]. Each bootstrap sample is a vector of counts (N 1 ;N 2 ;:::;N jmax ) that satisfies P jmax i=1 N i = S 1 ; where j max is the largest observed frequency for a species in the initial sample and S 1 is the number of species observed in the initial sample. The (N 1 ;N 2 ;:::;N jmax ) is sampled from a multinomial distribution with probability in proportion to (N 1 ;N 2 ;:::;N jmax ). For each bootstrap, we construct an estimator r;m (t) for ther-SAC. All estimators r;m (t) are then used to calculate the variance of the estima- tor r;m (t). To estimating confidence intervals as percentiles of the bootstrap distribution requires too many samples (e.g. Efron & Tibshirani [35, Chapter 13] suggest 1000) for large-scale appli- cations. Instead we adopt the lognormal approach, where the mean and variance can be accurately estimated using far fewer bootstrap samples. Use of the lognormal is justified by an observed natural skew for quartiles of estimates in our simulation results (Figure 3.2a). All the functionality has been implemented into an R package called preseqR [31]. It is avail- able through CRAN at: https://CRAN.R-project.org/package=preseqR 31 3.4 Simulation studies We carried out a simulation study to assess the performance of the estimator r;m (t). The simu- lation scheme is partly inspired by Chao and Shen [15] but involves populations and samples of larger scale. Following our statistical assumptions, the number of individuals for speciesi in the initial sample follows a Poisson distribution with the rate i , for i = 1; 2;:::;L. The rates i are generated from distributions we have chosen to model populations with different degrees and types of heterogeneity. We measure the degree of heterogeneity in a population by the coefficient of variation (CV) for i : 1 (L 1) 1 P L i=1 ( i ) 2 1=2 ; where = P L i=1 i =L: (3.16) The coefficient of variation quantifies difference in relative abundances among species and is inde- pendent of sample sizes. For the type of heterogeneity, we focus on the shapes of distributions, for example distinguishing those with exponentially decreasing tail versus heavy-tailed distributions. We selected seven models for our simulations. The first is a homogenous model, the Poisson distribution (P), included as a basis for comparison with the other models. Models 2-4 are negative binomial (NB), where the i follow gamma distributions (NB1-NB3). The NB models are widely used in analyses of counts data with overdispersion [52]. We use three gamma parameterization (Table 3.1) to simulate populations with different degrees of heterogeneity (CV = 0:1; 1; 10). For CV approaching 0, the values of i approach each other and the corresponding NB model approaches homogeneity. The fifth model is a lognormal (LN) model [10], which has been applied, for example in ecology [87]. Models 6 and 7 are a Zipf distribution (Z) [116] and a Zipf- Mandelbrot distribution (ZM) [71], respectively, which are applied in a variety of fields [1, 79]. The parameter i in the Zipf model is proportional to 1=(i +a) with constanta> 0. For the Zipf model P i i does not converge asL goes to infinity. For the Zipf-Mandelbrot model, the rate i 32 also follows a power law, but the corresponding sum does converge asL approaches infinity. Mod- els 5–7 represent so called heavy-tailed populations [79]. The parameter settings for these models are summarized in Table 3.1. In our simulations we fixed the total numberL of species at 1 million (M) to represent large- scale applications. For the results described below, the initial sample sizeN was also set to 1M individuals so that P L i=1 i L = 1: For each of the 7 models, we simulated 1000 datasets, with each dataset serving as an initial sample from which to construct estimators. Our simulations covered (t;r) representing the region [1; 100]f1;:::; 100g, which more than covers the (t;r) we have seen in practical applications. We use relative error to measure the performance of an estimator. For a fixed value ofr, the relative error is calculated as theL 2 distance between the expected curve E[S r (t)] and the estimate, divided by theL 2 norm of E[S r (t)] over the regiont2 [1; 100], which we approximate at discrete values t = 1; 2;:::; 100. We add 1 on the numerator to avoid problems that may arise (e.g. for larger) when the norm of the expected curve is too close to zero. The errors we report are the means of relative errors over the curves forr = 1; 2;:::; 100. We compared the estimator r;m (t) with several other estimators. The zero-truncated Pois- son (ZTP) [20] and zero-truncated negative binomial (ZTNB) [92] are obvious and expected to perform well when the underlying statistical assumptions of the estimator matches the model of the simulation. The logseries (LS) approach, popularized in ecology, was introduced as a special case of the ZTNB method when the shape parameter in the negative binomial distribution was close to 0 [39]. To our knowledge, there is no nonparametric estimator designed for E[S r (t)] when r > 1. To evaluate other plausible approaches, we made use of two nonparametric estimators for SACs, specifically those due to Boneh et al. [6] and Chao et al. [15], which we refer to as BBC and CS, respectively. We leveraged equation (3.3) in Theorem 3.1 to derive general estimators of 33 Table 3.1: Models used in simulation studies. Parameter settings and the value of CV for each model are listed. For models Z and ZM, the values of CV are calculated based on equation (3.16). For other models, listed values for CV are expectations directly calculated based on the underlying distributions and parameters. Model Name Distribution on rates CV P Homogeneous i / 1 0 NB1 Negative binomial i Gamma(shape=100, scale=1) 0:1 NB2 Negative binomial i Gamma(shape=1, scale=1) 1 NB3 Negative binomial i Gamma(shape=0.01, scale=1) 10 LN Poisson-lognormal log( i ) Gaussian(0; 1) 1:31 Z Poisson-Zipf i / 1=(i + 100) 10:79 ZM Poisson-Zipf-Mandelbrot i / 1=(i + 100) 1:1 15:10 E[S r (t)], forr 1, based on these two estimators of E[S 1 (t)]. These derivations can be found in the appendix (Section C). 3.4.1 Simulation results As can be seen from Figure 3.2a, the estimator r;m (t) performs well under models NB2, NB3, LN, Z and ZM. We consider these to represent heterogeneous populations due to their large CV compared with the homogeneous model (Table 3.1). The relative errors are 0:002 (0:003) and 0:027 (0:011) for NB2 and NB3. The errors for the Z and ZM models are slightly higher: 0:057 (0:042) and 0:057 (0:040), respectively (Table 3.2). Both the relative error and the standard error of r;m (t) are much higher when applied to the homogenous models (Figure 3.2a). Although the NB1 model is a negative binomial distribution, it has the CV close to 0 and we are not surprised that it shows behavior similar to that of the Poisson model. We compared the estimator r;m (t) with the five other estimators. The estimator r;m (t) has the least mean relative error compared with other approaches under the LN, Z and ZM models (Figure 3.2b), which are the heavy-tailed models. The relative errors under these three models are 0:020, 0:057 and 0:057 (Table 3.2). In particular, under the Z and ZM models, the second most accurate approach, our generalization of CS estimator, has relative error 0:525 and 0:558, around 34 P NB1 NB2 NB3 LN Z ZM NB2 NB3 LN Z ZM 0 0.2 0.4 0.6 0.8 1.0 relative error (a) (b) relative error 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 ZTP ZTNB BBC CS LS Figure 3.2: Relative errors in simulation studies. (a) relative error of the estimator r;m (t) for the seven simulation models. Box plots are based on 1000 replicate simulations. The horizontal bar displays the median, boxes display quartiles and whiskers depict minima and maxima. (b) Mean relative error of all tested estimators for simulated datasets under heterogeneous models (all but P and NB1) based on 1000 replicates for each estimator. The error bars show the 95% confidence interval of relative errors. 10 the error of r;m (t). The estimator r;m (t) has higher standard error compared with the other methods (Figure 3.2b), which we attribute broadly to its use of procedures (e.g. to fit the Pad´ e approximant) that can introduce numerical error. Even considering this variation, when r;m (t) is at its least accurate it remains substantially more accurate than the other methods across models LN, Z and ZM. As expected, for model NB1-3, the ZTNB approach is the most accurate because it matches the precise statistical assumptions of those simulations. Importantly, without any assump- tion about the latent distribution of i , the estimator r;m (t) also yields excellent accuracy in the NB2 and NB3 models, with relative errors less than 5%. The LS approach performs similar to the ZTNB approach when the shape parameter in the NB model is close to zero, as occurs for NB3 (Figure 3.2). Similarly, for the homogeneous population model the ZTP approach is the most accurate. Note that as a nonparametric estimator, our generalization of CS estimator is also very accurate under homogeneity (Table 3.2). We found the estimator r;m (t) to be more accurate when the population samples correspond to heavy-tailed distributions. In general, these are the most challenging scenarios for accurately 35 predicting E[S r (t)] (Figure 3.2b). The accuracy of all estimators deteriorates as the distribution of i change from exponentially-decaying to heavy-tailed. However, the performance of the estimator r;m (t) is much less affected compared with other approaches, and r;m (t) remains accurate in both the Z and ZM models. The NB3 and Z models have a similar degree of heterogeneity in terms of CV (Table 3.1), but for all estimators except r;m (t), relative error for Z is clearly larger than the error for NB3. This difference is associated with the change from exponentially decreasing (NB3) compared with the power law distribution. For r;m (t), the relative error remains small in both these scenarios. The above results correspond to an initial sample size ofN = L, but for initial samples of 0:5L to 2L the mean relative error changed very little for the heterogeneous models (Figure 3.3). The error only noticeably increased when the sample size was below 0:4L. 3.4.2 Examining performance on homogeneous populations Clearly our estimator suffers when the samples are generated from a homogeneous population (Fig- ure 3.2a). Initially our intuition was that this should be the “simplest” case for prediction because the i are constrained by a single parameter. Although we found the behavior of r;m (t) on the homogeneous population to be peculiar in multiple ways, we will examine two aspects here: Why does our construction select m differently for homogeneous versus heterogeneous populations? And how can we understand the specific consequences of this difference? Compared with the heterogeneous populations, our construction for r;m (t) selects m in a very narrow range for the homogeneous populations (Figure 3.4). For the LN, Z and ZM models the mode of the selectedm is 10, which is our default starting point–the largestm we allow. In contrast, the values ofm in r;m (t) never exceed 5 for the P and NB1 models, and usually take m = 1. In the 1000 simulations for P and NB1, there are 439 and 489 cases in which r;m (t) reduces to r;1 (t). Although r;1 (t) is sometimes chosen for NB2 (Figure 3.4), it does not appear for any of the other heterogeneous population models. There is a gap betweenm = 1 andm = 3 under the P and NB1 models (Figure 3.4), because the estimator r;m (t) always fails to be an 36 0 0.05 0.1 0.15 0.2 0.25 0.3 relative error 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 initial sample size (M) NB2 LN ZM NB3 Z (a) (b) (c) (d) (e) Figure 3.3: The relative error of r;m (t) as the size of an initial sample increases. The size an initial sample varies from 0:2L to 2L. The x-axis is the initial sample size. The red line is the mean of relative error over 1000 replicates. The error bars show the 95% confidence interval of relative errors. increasing function atm = 2 under these models (see step 7 in Algorithm 1). On the other hand, for heterogenous models, the estimator atm = 2 always succeeds when the estimator r;m (t) fails required criteria atm = 3. 37 Table 3.2: Relative error and standard error for the seven simulation models. For a fixed value of r, the relative error is calculated as the L 2 distance between expected curve E[S r (t)] and the estimate, divided by theL 2 norm of E[S r (t)] over the regiont2 [1; 100]. The errors we report are the means of relative errors over the curves forr = 1; 2;:::; 100. The numbers in the parentheses are standard errors based on 1000 replicates. P NB1 NB2 NB3 LN Z ZM r;m (t) .320 (.112) .292 (.113) .002 (.003) .027 (.011) .020 (.014) .057 (.042) .057 (.040) ZTNB .008 (.008) .009 (.007) .001 (.001) .004 (.002) .090 (.001) .683 (.001) .798 (.001) LS .569 (.000) .546 (.000) .211 (.000) .006 (.002) .137 (.000) .682 (.001) .797 (.001) CS .003 (.002) .050 (.000) .375 (.001) .204 (.002) .410 (.001) .525 (.001) .558 (.001) BBC .500 (.000) .482 (.000) .319 (.000) .126 (.003) .439 (.001) 1.090 (.002) 1.126 (.002) ZTP .002 (.002) .052 (.000) .484 (.000) .299 (.002) .637 (.001) 1.456 (.002) 1.505 (.003) The consequence of resorting tom = 1 is informative. Recall that r;1 (t) can be written as r;1 (t) = S 2 1 S 2 t t + (S 1 S 2 )=S 2 r : Let denote the ratioN=L. In the P model, i =N=L = fori = 1; 2;:::;L. The expectation of S 1 andS 2 is thenL(1 exp()) andL(1 (1 +) exp()), leading to a rough approximation for the coefficient in r;1 (t): E S 2 1 S 2 E[S 2 1 ] E[S 2 ] =L (1 exp()) 2 1 (1 +) exp() : In our simulation study, we have =N=L = 1. The coefficient thus is approximated by E S 2 1 S 2 L (1 exp(1)) 2 1 2 exp(1) 1:5L: Under the P model, the estimator r;1 (t) corresponds to an overestimate of the population sizeL by 50%. The relative error of E[ r;1 (t)] is around 0:42, which is consistent with simulation results in Figure 3.2a. Note that the calculations above depend onN = L. As the sample size increases, the coefficient decreases toward 1, and the relative error of E[ r;1 (t)] decreases. For instance, when N = 2L this coefficient reaches 1:025L and the relative error for E[ r;1 (t)] is 0:235. We 38 Table 3.3: The expected coefficient and pole for r;1 (t) are calculated under the P, NB1 and NB2 models. The closed-form of r;1 (t) is expressed in equation (3.15) in the main text. The second column is the estimate of the expectation of the coefficient in the estimator r;1 (t). The value is calculated by E[S 1 ] 2 =E[S 2 ]. The third column the estimate of the expectation of the pole in the estimator r;1 (t). The value is calculated by (E[S 2 ] E[S 1 ])=E[S 2 ]. model coefficient pole relative error P 1.512L -1.392 0.420 NB1 1.503L -1.385 0.390 NB2 1L -1 0 also calculated the E[ r;1 (t)] under model NB1 (Figure 3.4). The relative error of E[ r;1 (t)] in the NB1 model is 0:39 (Table 3.3). We compared relative errors betweenm = 1 andm > 1. Using one-tailed t-test, we found the relative error for r;1 (t) is significantly greater than the relative error for r;m (t) withm > 1 under both the P and NB1 model (p < 2:2 10 16 ). Note that there are cases where the estimator r;m (t) reduces to r;1 (t) for model NB2. However, the relative errors of r;m (t) is small because the NB2 model is an exception. The expectation of estimator r;1 (t) happens to be unbiased under the model. This explains the small relative errors in the NB2 model even though there are 68 out of 1000 cases wherem = 1. In addition, we found the sample coverage is less under the homogeneous than under hetero- geneous models. Here the sample coverage is the proportion of species in the population that are covered by the sample. In fact, we proved that among all possible populations following a mixture of Poisson distributions, the sample generated by the homogeneous population has the least sample coverage, provided N + 1 < L. The proof can be found in Appendix D. The result means that samples based on the homogeneous model is the least saturated among all populations. We believe all these facts could affect the performance of r;m (t) under the homogeneous model. 3.4.3 Best practices Based on simulations, we found that the estimator r;m (t) is accurate when populations are het- erogeneous (Figure 3.2a). It suffers large relative errors and variance when populations are close to 39 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 1 3 5 7 9 2 4 6 8 10 0 300 600 P ZM NB1 NB2 NB3 LN Z 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 Figure 3.4: The histogram of the parameterm in constructing r;m (t) based on 1000 replicates. There are 10 bins in each histogram. The 10 bins correspond to the number of estimators with m = 1;; 2;:::; 10. They-axis is the frequency. being homogeneous, a context where the ZTNB works well (Table 3.2). Our best-practice advice is to combine both our estimator r;m (t) and the ZTNB. Whenever samples are generated from a heterogeneous population, we should use the estimator r;m (t); otherwise, we switch to the ZTNB estimator to handle the homogeneous cases. We use coefficient of variation (CV) to measure the degree of the heterogeneity in a population. In practice, whenever the estimated CV is greater than 1, we use our estimator r;m (t); otherwise, we switch to the ZTNB estimator. The problem with this strategy is that the value of i is unobservable. We can not directly assess the heterogeneity of a population. We propose a heuristic approach to estimate the CV. First, the initial sample is fitted to the ZTNB model. Let k denote the estimated shape parameter in the model. The value 1= p k is then used as an estimate for the CV. In particular, If the population follows a negative binomial distribution with the shape parameterk, then the expected value of the CV is exactly 1= p k. From the simulation results, we found that the shape parameter in the ZTNB approach is highly sensitive to the heterogeneity of a population (Figure 3.5a). When applying the ZTNB approach to a sample from a heavy-tailed population, the estimated shape parameter is close to 0. In contrast, for a sample from the homogeneous population, the estimated shape parameter is large by orders of magnitude. All samples from the P and NB1 models have estimated CVs less than 1 and all 40 samples from the NB3, LN, Z and ZM model have estimated CVs greater than 1 (Figure 3.5a). Estimated CVs are around 1 for the NB2 model, where both r;m (t) and the ZTNB approach give accurate estimates (Table 3.2). In practice, whenever the estimated shape parameter is less than 1, or equivalently the estimated CV is greater than 1, we use our estimator r;m (t); otherwise, we switch to the ZTNB estimator. To examine this cutoff, we simulated samples from populations with CVs from 0:1 to 10 using NB models. The mean of relative errors for r;m (t) drops quickly at CV = 0:6 (Figure 3.5b). At the point CV = 1 (k = 1), where we draw the cutoff, the mean of relative errors decreases to the minimum value 0.002. The model with CV = 1 corresponds to model NB2, which is an exception from the previous discussion. In general, we should expect a higher relative error than the relative error in the NB2 model when the CV is not equal to 1. As the CV continues increasing from 1, we observed that the prediction errors are less than 0:05 and become stable (Figure 3.5c). The result suggests that this cutoff CV = 1 can select those cases that are in favor of our estimator and control the relative error in practice. The histogram of the parameterm shows the same trend as well (Figure 3.6). As the CV increases from 0.6 to 0.7, one can see the quick drop of the frequency ofm = 1, which corresponds to the quick drop of the mean of relative error. When the CV is no less than 2, the frequency ofm = 1 becomes 0. 3.5 Large-scale applications We applied our estimator to data from three different domains: linguistics, a social network, and a DNA sequencing application. In each case the data may be considered “big.” We adopt a strategy of subsampling from the full available data to generate a ground truth reference for evaluation. We include the ZTNB for comparison due to its popularity for overdispersed counts data [52]. The estimated CV for each dataset is in Table 3.4. 41 −4 −2 0 2 4 shape parameter (log10) (a) P NB1 NB2 NB3 LN Z ZM model 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 CV relative error (b) ZTNB 2 4 6 8 10 CV 0 0.1 0.2 0.3 0.4 relative error (c) Figure 3.5: Degrees of the heterogeneity and effects on relative errors of r;m (t). (a) Estimated shape parameterk. The estimated shape parameterk was obtained by fitting a zero-truncated neg- ative binomial (ZTNB) distribution to the data. Results are obtained by simulation 1000 samples for each model. Thex-axis represents model id. They-axis is the logarithm of estimated shape parameters, with base 10. The red dash line is the cutoff we use to distinguish heterogeneous pop- ulations from homogeneous populations. (b, c) The mean relative error under NB models. We vary the shape parameter in NB models to simulate populations with different values of CV. The red line is the mean relative error of the estimator r;m (t) as a function of the value of CV. The error bar is the 95% confidence interval of relative errors. (b) The range of CV is from 0:1 to 1. (c) The range of CV is from 1 to 10. Table 3.4: CVs for datasets in large-scale applications Shakespeare Dickens Twitter SRX202787 84.82 83.89 85.60 1.31 3.5.1 The vocabulary of Shakespeare and Dickens We first re-examined the Shakespearean vocabulary problem due to Efron and Thisted [34]. The data is 884,647 words written, corresponding to a set of 31,534 distinct words. There are 14,376 distinct words that appear exactly once in the collection, 4343 that appear exactly twice, and so on. The full word appearance frequencies are listed by [34] (Table 3.5). Our task is to predict the number of distinct words that would appear at least r times if some additional quantity of Shakespeare’s work is discovered. Although the ground truth is unknown, our prediction results are surprisingly consistent with previous studies [34]. 42 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 12 3456 78910 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 0 300 600 CV = 0 CV = 0.1 CV = 0.2 CV = 0.3 CV = 0.4 CV = 0.5 CV = 0.6 CV = 0.7 CV = 0.8 CV = 0.9 CV = 1 CV = 2 CV = 3 CV = 4 CV = 5 CV = 6 CV = 7 CV = 8 CV = 9 CV = 10 Figure 3.6: The histogram of the parameterm for r;m (t) as CV increases. The shape parameter in the NB model is selected to simulate populations with different CVs. In particular the model with CV = 0 represents the homogeneous model. The histogram is based on 1000 replicates from each model. There are 10 bins in the histogram. Thej th correspond to the number of estimators withm =j. They-axis is the frequency. Table 3.5: Shakespeare’s word use frequencies. Entryr isS r , the number of words that appear at leastr times in Shakespeare’s known work. r 1 2 3 4 5 6 7 8 9 10 +0 31534 17158 12815 10523 9060 8017 7180 6542 6023 5593 +10 5229 4924 4665 4423 4200 4013 3832 3653 3523 3396 The numbers S j of distinct words that appear in the collection at least j times, for j = 1; 2;:::; 20, are given in Table 3.5. We applied Algorithm 1 and obtained: r;m (t) = 120357:66 t t + 14:91 r + 24934:99 t t + 1:13 r + 13453:12 t t + 0:10 r : 43 Table 3.6: Predictions by r;m (t) for unique words in Shakespeare’s work. The expected number of word types when a total of 884647t words are discovered. The first and second columns are estimates and standard error by r;m (t). The fourth and fifth columns are lowerbound and upperbound estimated by Efron’s estimator [34, Table 5]. lower bound by upper bound by t r;m (t) SE Efron-Thisted estimator Efron’s estimator 2 11,459 586 11,205 11,732 4 26,494 1,171 23,828 29,411 6 37,215 2,582 29,898 45,865 11 55,501 7,675 34,640 86,600 21 75,894 18,765 35,530 167,454 The estimator 1;m (t) predicts 42993 (586:17) distinct words whent = 2 (i.e. the unlikely event that “the other half” of Shakespeare’s were to be discovered). The additional work is expected to contain 11459 new word types. The corresponding prediction by Good and Toul- min’s estimator is 11430, and the prediction by Fisher’s negative binomial model is 11483 [34]. Prediction results of 1;m (t) fort = 4; 6; 11; 21 are in Table 3.6. All these estimates are consistent with the estimation by [34] (Table 3.6). Among Shakespeare’s known works, 17,158 words appear at least twice. When t = 2 and r = 2, r;m (t) predicts a total of 24101 distinct words are expected to be observed at least twice. So there are 6943 new word types observed at least twice when doubling the amount of text. These new word types could be either from Shakespeare’s known work that are observed exactly once, or from words only observed in the additional work. A total of 14376 word types appeared exactly once in Shakespeare’s known work. Therefore at least 7433 (14376 6943) word types appeared once in Shakespeare’s known work are likely to be missed if another same amount of work by Shakespeare are discovered. We also applied the estimator r;m (t) to infer word frequencies in a sample of Charles Dickens’ work. We used data from Project Gutenberg [50] that is included in the R packagezipfR (v0.6- 6) [37]. This data set contains roughly 2.8M written words, of which just over 41k are distinct. We sampled 300k words from the dataset as an initial sample and applied r;m (t) for values of 44 0 1 2 3 4 total words (M) 0 10k 20k 30k 40k 50k distinct words (a) 0 1 2 3 4 total words (M) 0 10k 20k 30k 40k 50k (b) ZTNB r = 1 r = 2 r = 5 r = 10 r = 20 initial predicted distinct words Figure 3.7: distinct words represented at leastr times in samples of Dickens’ work as a function of words written. Estimates are based on an initial sample of 300k words (vertical dashed lines), extrapolated to 13 the initial sample size. Expected values were obtained by subsampling the full data set (about 9 the initial sample size) without replacement. (a) Estimator r;m (t). (b) ZTNB. r between 1 and 20. Figure 3.7 shows the estimated curves along with actual curves from the entire Dickens data set. The estimated curves track the corresponding actual curves very closely. In contrast, ZTNB is inaccurate for bothr = 1 andr > 1. Table 3.7 shows estimated values and their standard errors (SD) for extrapolations of 5 and 9, the latter being the maximum possible given the size of the data set. Even atr = 20 the relative error never exceeds 5%. Here the relative error is the absolute deviation between the predicted value and the observed value, dividing by the observed value. We also examined the behavior of r;m (t) as a function of r. As can be seen from Figure 3.10a, r;m (t) remains accurate for large values ofr. In comparison, ZTNB tends to overestimate the observed values (Figure 3.10a). Previously our initial samples were generated from downsampling the entire dataset. In this example, we would like to use one of Charles Dickens’ works as the initial sample to predict the number of unique words in all his work. We chose the novel “Our Mutual Friend” as the initial sample. It is written in the years 1864-1865, which is the last novel completed by Charles Dickens 45 (a) (b) r = 1 r = 2 r = 3 r = 5 r = 10 r = 20 predicted initial 0 1 2 3 4 total words (M) 0 10k 20k 30k 40k 50k distinct words 0 10k 20k 30k 40k 0 0.5 1 1.5 2 2.5 3 total words (M) expected empirical distinct words Figure 3.8: Empirical versus expected number of distinct words. (a) Distinct words represented at leastr times in samples of Dickens’ work as a function of words written. Estimates are based on the novel “Our Mutual Friend” (vertical dashed lines), extrapolated to 13 the size of the novel. Expected values were obtained by subsampling the full data set (about 9 the initial sample size) without replacement. (b) Empirical versus expected number of distinct words in samples of Dickens’ work as a function of words written. The solid line is the expected accumulation curve by subsampling the entire dataset without replacement. The dashed line is the actually observed accumulation curve. according to Wikipedia. The novel contains roughly 300k words, of which 15,947 are distinct. Figure 3.8 shows the estimatedr-SAC along with actual curves by subsampling the entire Dickens data set. Comparing with prediction results based on a random sample of size 300k (Figure 3.7), estimates based on the novel were less close to the subsampled curve. This results is expected because our approach assumes that the initial sample and the future sample are homogeneous. And the golden-standard curves for comparison was made by subsam- pling the entire Charles Dickens’ work without replacement, which also implied the homogeneity between samples. However, the category of words from one novel to another can be very different depending on the background, the story, the age of the author and many other factors, so the homo- geneity may not hold between the novel “Our Mutual Friend” and other Charles Dickens’ work. To justify the assumption, we plotted the expected accumulation curve based on subsampling and the actually observed accumulation curve (Figure 3.8b). The largest difference between two curves is around 4K words. This explains the inaccuracy of using one novel to predict the r-SAC. But overall the homogeneous statistical assumption is reasonable (Figure 3.8b). 46 Table 3.7: Accuracy in predicting Dickens’ frequency of word use. Observed values and estimates from r;m (t) for numbers of words used at leastr times in a text along with scaled error. Estimates are shown for extrapolating to a text that is 5 and 9 larger than the initial 300k word sample. 5 9 r observed predicted error observed predicted error 1 32926 32283 0.02 41116 37889 0.08 2 21063 22521 0.07 26896 29505 0.10 3 16781 17199 0.02 21894 23899 0.09 5 12532 12345 0.01 16953 17345 0.02 10 8170 8205 0.00 11650 11321 0.03 20 5048 5181 0.03 7673 7794 0.02 3.5.2 Followers in a social network We also applied the estimator r;m (t) to predict the number of Twitter users that will have r or more followers, based on a small sample of “(follower, followed)” relations. We obtained a data set from the Social Computing Data Repository [114]. This data set contains 11.3M users and over 85.3M following relationships, which form edges in this social network. We randomly sampled 5M edges as an initial sample and used this to estimate the number of users with at leastr followers in larger sets of following relationships. Estimates using r;m (t) show high accuracy when we extrapolated to 5 the size of the initial sample, as can be seen in Figure 3.9. For example, when the total number of following relations is 25M, we should see roughly 1.2M individuals with at least 2 followers (Table 3.8). Our prediction of just around 1.3M is off by 3%. Accuracy decreases for larger extrapolations. The entire dataset contains 2.8M users, each of which has at leastr = 2 followers. Our estimator predicts 3.1M, an overestimate of around 10%. Interestingly, accuracy does not rapidly worsen withr and seems to remain high for values ofr up to 100 (Figure 3.10b), consistent with our results on the linguistic data set. For both the Dickens and the Twitter applications, the error from ZTNB is substantially higher than from our estimator (Figure 3.7, 3.9). At the same time, the estimates from ZTNB are less sensitive tor than those of r;m (t) (Figure 3.10). 47 0 20 40 60 80 100 following relationship (M) (a) 0 1 2 3 4 5 6 7 users (M) 0 20 40 60 80 100 following relationship (M) 0 1 2 3 4 5 6 7 users (M) ZTNB (b) r = 1 r = 2 r = 5 r = 10 r = 20 initial predicted Figure 3.9: Twitter users with at least r followers in a sample of following relations. Estimates are based on an initial sample of 5M relations (vertical dashed lines), extrapolated to 20 initial sample size. Expected values were obtained by subsampling without replacement the full data set, which is about 17 the size of the initial sample. (a) The estimator r;m (t) and (b) ZTNB. Table 3.8: Accuracy in predicting number of Twitter users with multiple followers. Observed values and estimates from r;m (t) for numbers of users withr or more followers along with scaled error. Predictions are shown for extrapolating to a social network that is 5, 10 and 17 larger than the initial 5M following relationship sample. 5 10 17 r observed predicted error observed predicted error observed predicted error 1 3205418 3164608 0.01 4861765 4611910 0.05 6626985 5864172 0.12 2 1240448 1320501 0.06 1998589 2339063 0.17 2850180 3513562 0.23 3 804297 785919 0.02 1325177 1389673 0.05 1917058 2270164 0.18 5 490018 481062 0.02 831020 779502 0.06 1224301 1205101 0.02 10 251777 254193 0.01 454867 454825 0.00 688106 643717 0.06 20 122336 122847 0.00 241468 243418 0.01 386003 392038 0.02 3.5.3 Coverage in DNA sequencing experiments To evaluate our approach on a larger scale, we applied our estimator to predict the number of nucleotides in the human genome that will be represented at leastr times in a sequencing data set. In genomics terminology, these are the positions in the genome covered by at least r sequenced 48 (a) (b) users followed Dickens 10 3 10 3.5 10 4 10 4.5 10 5 10 4.5 10 5 10 5.5 10 6 10 6.5 10 7 0 20 40 60 80 100 0 20 40 60 80 100 value of r value of r observed ZTNB Twitter distinct words Figure 3.10: Species observed at leastr times along with estimates forr up to 100. (a) Distinct words represented at leastr times along with estimated values fort = 9. (b) Users with at leastr followers along with estimated values fort = 17. The estimators r;m (t) and ZTNB are compared. reads, or the positions with coverage of at leastr. Coverage is critical in genetics studies, for exam- ple in detecting SNPs, where candidate SNPs with low coverage are often discarded. Knowing the distribution of coverage can help researchers in experimental design, informing the total amount of required sequencing in order to attain sufficient coverage over an acceptable number of genomic sites [119]. We downloaded a single-cell DNA sequencing experiment from NCBI (accession numbers SRX202787) to evaluate the performance of r;m (t) [118]. We subsampled 5M reads as a “shal- low” sequencing experiment. The sample was mapped to human genome reference GRCh38/hg38 by BWA using default parameters [64]. We used Picard tools [9] to sort mapped reads and Sam- tools [65] to obtain the number of nucleotides covered by j sequenced reads, for j = 1; 2;::: These formed the counts N j . We then estimate the number of sites that would attain minimum required depth as sequencing continues based on counts N j . Figure 3.11a shows the curves for estimated values r;m (t) for multiple values ofr, along with the actual expected values obtained by repeated subsampling from the full data set. Figure 3.11b presents the same information for 49 nucleotides sequenced (G) 0 10 20 30 40 50 60 70 base-pairs covered (G) 0.0 1.0 2.0 3.0 0.5 1.5 2.5 3.5 (a) nucleotides sequenced (G) 0 10 20 30 40 50 60 70 base-pairs covered (G) (b) ZTNB r = 1 r = 2 r = 5 r = 10 r = 20 predicted 0.0 1.0 2.0 3.0 0.5 1.5 2.5 3.5 Figure 3.11: Base pairs covered at least r times in a DNA sequencing data set. Estimates are based on an initial sample of 5M reads (500M nucleotides), as indicated by vertical dash lines, which were extrapolated to more than 160 the size of the initial sample. Expected values were obtained by subsampling without replacement the full dataset, which is around 107 the size of the initial sample. Initial sample size is not indicated as the small size would not be visible. Estimates made using (a) the estimator r;m (t) and (b) ZTNB. estimates based on the ZTNB. This data set is sufficiently large to reveal the inflection points in the curves whenr > 1. Estimates from r;m (t) closely track the true values. Even extrapolating up to 100 times, the relative error is less than 5% for variousr (Table 3.9). The ZTNB, on the other hand, overestimates E[S 1 (t)] and then underestimates for other values ofr. The value of r;m (t) atr = 1 andt = 100 provides a lower-bound estimate of the population size. This may be interpreted as the number of “known” base pairs (assembled and mappable) in the human reference genome, which in this application is about 3.05 billion. The relative error of the estimation is 0:13 (0:01). We should further take into account the actual population size of the sequencing library, i.e. the observable fraction of the population. We may assume at least 5% of the reference genome cannot be identified in the sequencing library, no matter the sequencing depth, which is usually the case for whole genome sequencing. The relative error for estimatingL in this case would be less than 0:09. 50 Table 3.9: Accuracy in predicting number of base pairs covered byr or more reads. Estimates from r;m (t) for the number of base pairs covered byr or more reads along with standard errors (SD) and relative errors. Estimates are shown for extrapolating to a DNA sequencing experiment that is 5 and 100 larger than the initial 5M reads. The SD is estimated by 100 bootstrap samples. The relative error is the absolute deviation between the predicted value and the observed value, dividing by the observed value. 5 100 r predicted SD relative error predicted SD relative error 1 1,129,927,035 329,156 0.010 2,652,400,412 17,217,958 0.022 2 479,320,554 638,870 0.017 2,446,839,012 11,662,586 0.015 5 57,629,462 174,216 0.041 1,930,375,358 1,472,315 0.001 10 4,345,056 67,038 0.110 1,323,241,725 5,343,486 0.003 20 717,334 70,268 0.226 667,534,255 3,020,688 0.021 3.6 Conclusion and discussion We introduced a new approach to estimate the number of species that will be represented at least r times in a sample. The nonparametric estimators obtained by our approach are universal in the sense that they apply across values ofr for a given population. We have shown that these estimators have favorable properties in theory, and also give highly accurate estimates in practice. Accuracy remains high for large values of r and for long-range extrapolations. This approach builds on the theoretical nonparametric empirical Bayes foundation of Good and Toulmin [45], providing a practical way to compute estimates that are both accurate and stable. The foundation for our approach is a relation between the r-species accumulation curve E[S r (t)] and the (r 1) th derivative of the average discovery rate E[S 1 (t)]=t. This relation char- acterizes E[S r (t)] directly, avoiding the summation of E[N j (t)] estimates. Clearly any estimator for either E[S r (t)] or E[N r (t)] provides a means of estimating both quantities. By definitionS r (t) is the sum ofN j (t) forj r. SimilarlyN r (t) can be written asS r (t)S r+1 (t). We prefer to work withS r (t) because E[S r (t)] is an increasing function—an property that is extremely useful for identifying problems during estimator construction (Section 3.3.2). We use rational functions to approximate the r-SAC. The advantages of RFA stem from increased freedom to describe functions or to constrain how those functions are estimated. The 51 coefficients of an approximating rational function are usually determined in a way that allows them to best fit observed data. The choice of forms for rational functions, on the other hand, can be independent of the data and can be determined by prior knowledge of the target function we seek to approximate. In this work we use a class of rational functionsP m1 (t)=Q m (t) with numerator degreem1 and denominator degreem to mimic the behavior of the average discovery rate, which is close toL=t for larget. Other forms of rational functions, in particular exampleP m (t)=Q m (t) andP m+1 (t)=Q m (t), were used when those forms made sense [27, 31]. Our empirical accuracy evaluations were based on applications in which the underlying data sets can be considered large compared to traditional applications from ecology. In particular, for modern biological sequencing applications, samples are frequently in the millions, and the scale of the data could be different by orders of magnitude. These large-scale applications present new challenges to traditional capture-recapture statistics, and call for methods which can integrate high-order moments to accurately characterize the underlying population. We generalized the classical study of estimating a species accumulation curve and propose a nonparametric estimator that can theoretically leverage any number of moments. We believe both this generalization and the associated methodology suggest possible avenues for practical advances in related estimation problems. 52 Chapter 4 Coverage estimation for DNA sequencing experiments The development of DNA sequencing technologies is progressing at its fastest pace in 20 years. Since the first draft of the human genome was made available in 2001 [25, 106], DNA sequenc- ing technology has evolved at a stellar pace. The original project took approximately 10 years to complete the first human genome, with a cost of around $2.7 billion. In comparison, nowa- days whole-genome sequencing (WGS) of a person takes less than a week and costs under $2,000. The sample is no longer restricted to bulk cells. Single-cell sequencing [78, 118, 42, 68, 16] has also become available, thanks to the advance of isolation and amplification methods, and the decreasing cost of DNA sequencing. Each individual can be characterized at the cellular level through genomic variations, combining DNA methylation landscapes (epigenomics), gene expres- sions (transcriptomics), and protein expressions (proteomics) [113]. As a result, disease prevention and treatment strategies can take individual variations into account, which is known as precision medicine or personalized medicine [21]. Whole-genome sequencing (WGS) and whole-exome sequencing (WES), are two types of DNA sequencing widely used in healthcare and research for identifying genomic variations that are responsible for diseases. For instance, Ng et al. [81] showed the association between the MYH3 gene and Freeman-Shaldon syndrome (FSS), based on WES of healthy human samples and unre- lated individuals with FSS; Choi et al. [19] made genetic diagnosis of congenital chloride diarrhea in a patient through mutations in SLC26A3 using WES; Puente et al. [88] used WGS to identify four genes that are recurrently mutated in Chronic lymphocytic leukaemia (CLL) patients. Ding et al. [33] revealed two clonal evolution patterns during acute myeloid leukaemia (AML) relapse by 53 WGS in AML patients. Moreover, many large-scale DNA sequencing projects have been launched to study genomic heterogeneity among populations. For example, The 1000 Genomes Project, which began in 2008, has analyzed 2,504 genomes from 26 populations using both WGS and WES, providing the most comprehensive human genetic variants [5]. WGS can detect single nucleotide variants in non-coding regions which include regulatory regions such as promoters and enhancers. Other types of genomic variants, such as structural variants (SVs), are more identifiable through WGS than WES because the later only provides a gapped view for exon regions [73]. Amplification is usually not necessary for whole-genome sequencing but is frequently used in WES in order to obtain a sufficient amount of genetic materials for capturing exons. Even for protein-coding regions, the coverage and the evenness of reads favor WGS over WES, because capture baits in WES and amplification could introduce extra bias for the sequencing library and result in sequences dropout and nonuniform of coverage. In addition, WES is designed for a few common species, such as humans, mice, and corn. For most species, the choice of WES is very limited. On the other hand, WES is a technology that specifically targets to sequence the protein-coding region of the genome. The method uses biotinylated oligonucleotide probes (baits) to capture exons. These captured regions are then pulled down, enriched and sequenced by high-throughput DNA sequencing. Compared with whole-genome sequencing (WGS), WES can archive higher coverage of protein-coding regions at a lower cost because the protein-coding regions usually constitute only a small fraction of the genome. For instance, in human genome the protein-coding regions is about 1% to 2% of the whole genome [80]. The cost is the main issue for WGS. According to Genohub [86], the price for performing a human WGS with a sequencing depth of around 35 is a little over $1,100. In comparison, the price for human WES using 62 million base pairs (Mb) targeted regions with around 100 coverage is half the price of WGS. Although the cost of DNA sequencing per base continues to decrease, the overall expense is still high for big sequencing projects. First, the number of samples subject to sequencing could be large due to the heterogeneity at both population and cellular levels. 54 For large population projects, the number of individuals subject to sequencing is large, e.g., 1000 Genomes Project reconstructed 2,504 individual genomes. At cellular level, studying tumor evolu- tion requires sequencing many individual cells to reveal cell-to-cell heterogeneity [78, 110, 118]. Second, the cost of sequencing is proportional to the sequencing depth. For examples, studies that aim to identify rare mutations rely on deep sequencing [103], which increases the cost. In addi- tion, storing, processing and analyzing sequences rely on many computational resources, including hardware and software, along with the man-hours and intelligence of well-trained data scientists [96]. The cost for data processing has become a larger fraction of the cost than the expense for generating the data [77]. For both WGS and WES, experimental designers must determine the sequencing depth in order to optimize benefit-cost ratio of the experiments under limited resources. The sequencing depth is defined as the number of sequenced nucleotides divided by the length of the genome. While this statistic is commonly used to measure the coverage of a sequencing experiment, it contains little information about the evenness of the sequencing library. For a very biased sequencing library, it is possible that the majority of reads are aligned to only a small fraction of genome, leaving the other regions of the genome uncovered. In this case, the sequencing depth does not reflect the actual coverage of the genome. An accurate measurement is per-base coverage, or simply termed coverage, which is the number of times that a nucleotide is covered in the genome [98]. For a sequencing experiment, one can express the genomic coverage as a vector (N 1 ;N 2 ;:::), whereN j is the number of nucleotides in the genome covered by exactlyj reads. Sequencing centers usually use the percentage of genome covered by at leastr reads to measure the quality of the experiment. Herer is chosen to be greater than 1 in order to confidently distinguish the true single nucleotide variant from the sequencing error. The exact r in each project may differ due to the goal of the study, the budget, etc.. It is no trivial matter to determine the number of reads that should be sequenced in order to achieve a given criterion, i.e., the fraction of genome with coverage at least r. If reads are ran- domly and uniformly distributed along the genome, Lander-Waterman model can accurately infer 55 the coverage [61]. However, due to amplification bias and other experimental factors, unifor- mity of genome coverage is difficult to achieve. In particular, single-cell sequencing requires pre- amplification, which causes a large variation in coverage compared with bulk sequencing [118]. This calls for methods that can take the heterogeneity into account. In the previous chapter, we proposed a non-parametric estimator r;m (t) to predict the number of species represented at least r times in a random sample, based on an initial sample. Extensive simulations and a few real data applications showed the superior of our method among all current methods when the abundance of species is heterogeneous. While the method has a wide range of applications in many different domains, in the context of genomic sequence, the method can be used to predict the number of nucleotides covered at leastr times in the genome as a function of the number of sequenced bases. In this chapter, we apply r;m (t) to accurately determine the optimal number of bases to sequence in order to achieve a certain pre-defined criterion for genome coverage, based on the sample from a pilot shallow sequencing experiment. Such criterion can be, for example, 85% of the nucleotides in human genome are covered by at least 4 reads. The study mainly focused on the two sequencing technologies, single-cell whole-genome sequencing (scWGS) and bulk whole- exome sequencing (WES). The former is the major tool to study the heterogeneity of genomic variations in a cell population such as tumor cells. The latter is the commonly used approach to detect rare mutations in protein-coding regions associated with diseases. For both scWGS and WES, we evaluated the accuracy of our method for predicting genome coverage, and then deter- mined the optimal sequencing depth to achieve sufficient genome coverage. Various commonly used sequencing protocols in both scWGS and WES were assessed, and their optimal sequencing depths were determined individually and compared. We also proposed two criteria for sequencing experiment design, one from the perspective of sufficient coverage, and the other from the per- spective of the benefit-cost ratio. All the results demonstrated the high accuracy of our method for predicting the genome coverage, and the great reliability of the method for optimizing the sequencing depth. With the exponential growth in the number of sequencing experiments imple- mented in both research studies for scientific questions and clinical studies for genetic diagnostics, 56 our method will be in great demand in those studies, to efficiently save the cost and optimally allocate the sequencing resources. 4.1 Overview of the method for predicting genome coverage We investigate the accuracy of our method for estimating genome coverage for a large sequencing experiment based on the sample from a shallow sequencing experiment. Here genome coverage is measured using the percentage of bases in the genome that is covered by at leastr reads. We studied the performance for r = 4; 8 and 10 since they are the commonly used thresholds in literature. For example, Sarin et al. [95] usedr = 4 for mutant allele identification in the genome of Caenorhabditis elegans. Wang et al. [109] appliedr = 8 to call new driver mutations in gastric cancer [109]. Puente et al. and Fujimoto et al. [88, 41] chose r = 10 to identify mutations in lymphocytes. To evaluate the prediction accuracy of our method, we downsampled a subset of reads from the full experiment, and used it as the initial sample to predict genome coverage in a large sample, which is 2, 5, 10, and 20 times the size of the initial sample. We then compared the predicted genome coverage with the ground truth, which is approximated by subsampling the full data, to assess the prediction accuracy of our method. We used the relative error, which is defined as the difference between the predicted and the true coverages divided by the true coverage, to quantita- tively measure the prediction error. Since the relative error can be either positive or negative, we also used the absolute value of the relative error to compare between different results. See Methods for the details of the procedures in data processing. Once the genome coverage is accurately predicted using a shallow sequencing experiment, researchers can easily determine the desired sequencing depth based on some pre-defined criterion. The criterion can be from the perspective of sufficient coverage, such as the minimal sequencing depth such that the majority of nucleotides in the genome will be covered by sufficient number of 57 reads. We can also choose the criterion from the perspective of benefit-cost ratio, which is a com- monly used measurement in the business model to determine the best amount of investment that achieves benefits while preserving savings. In the context of sequencing experiment, the number of new nucleotides with sufficient coverage increase rapidly at the beginning of the sequencing, and becomes stable when the library is saturated. If sequencing additional reads hardly increases the number of nucleotides with sufficient coverage, it is better to stop sequencing to save the resources. 4.2 Results 4.2.1 Predicting genome coverage for WES samples There are four commonly used exome capture systems: NimbleGen’s SeqCap EZ Exome Library, Agilent’s SureSelect Human All Exon, Illumina’s TruSeq Exome Enrichment and Illumina’s Nex- tera Exome Enrichment [18]. The differences among these systems are mainly from the targeted regions and the baits [18]. We are interested in evaluating our method for predicting genome coverage of WES samples using a shallow sequencing experiment, and comparing the prediction accuracy among these four types of exome capture systems. We collected WES samples from four independent studies, and each study used one of the four different exome capture systems [93, 29, 56]. For each study we chose five deeply sequenced datasets as representatives (Table 4.1). To generate the shallow sequencing data, we randomly sampled a subset of 1 million (M) reads from the original sample as the initial sample for our prediction. Our estimator is constructed based on the statistics such as the number of nucleotides in the genome covered byj reads,j = 1; 2;:::; in the initial sample. Thus we mapped the reads in the initial sample to the human genome to obtain the statistics. We removed the duplicate reads that are mapped to the exact same position in the genome. In other words, the coverage of each nucleotide in the targeted exome regions was counted based on uniquely mapped reads, not the 58 total aligned reads. This is a special and common step in WES data processing in order to mitigate the PCR amplification bias during the library construction [4]. Table 4.1: Whole-exome sequencing datasets. Whole-exome sequencing datasets from four inde- pendent projects with four different human exome capture systems, namely Agilent’s SureSelect Human All Exon, NimbleGen’s SeqCap EZ Exome Library, Illumina’s Nextera Exome Enrichment and Illumina’s TruSeq Exome Enrichment. accession id bait resource SRR1301329 NimbleGen’s SeqCap EZ Exome v2 Sanders et al. [93] SRR1301343 SRR1301345 SRR1301347 SRR1301348 ERR315871 SureSelect v4 Mb All Exon kit Damm et al. [29] ERR315877 ERR315885 ERR315904 ERR315908 NA12878-12p-5 TruSeq Exome Library Preparation kit Project: HiSeq 4000: TruSeq Exome from Illumina basespace NA12878-12p-16 NA12878-12p-17 NA12878-12p-29 NA12878-12p-41 NA12878-N701-S2 Nextera Rapid Capture Exome v1.2 Project: HiSeq 4000: NRC Exome from Illumina basespace NA12878-N701-S3 NA12878-N707-S2 NA12878-N707-S3 NA12878-N707-S4 We previously proposed a non-parametric estimator r;m (t) to estimate the number of nucleotides in the reference genome that is cover by at least r reads as a function of sequenc- ing effort t [32]. Here t is the relative size of the large sample compared with the size of the initial sample. For example, ift = 2, the large sample is twice the size of the initial sample. The parameterm is a hyper-parameter which can be determined automatically by data. The estimator is constructed all based on the observations in the initial sample. The statistic r;m (t) was developed previously as a general method for predicting the number of species represented at leastr times in 59 a random sample as the sample size increases. In the context of genomic sequencing, the species refers to the nucleotide in the genome and the sample size refers to the number of sequenced bases in the sequencing data. Since duplicated reads in WES data are not considered in computing genome coverage, the sequencing effort in terms of the total reads does not reflect the true sequencing effort in terms of the uniquely mapped reads. In addition, the duplicate rate is not a constant but grows as sequencing effort increases. Directly applying r;m (t) based on genome coverage of the initial sample will underestimate the sequencing effort. Thus, we developed a new statistic u r;m (t) to estimate the number of nucleotides covered by at leastr unique reads as a function of sequenced bases. The statistic integrates two steps of estimation, one for estimating the number of unique reads as a function of total number of reads, and the other one for estimating the number of nucleotides covered by at least r unique reads as a function of the number of unique reads sequenced. See Methods for details of the new statistic u r;m (t). We applied u r;m (t) for the shallow WES experiments of 1 M reads to predict the number of nucleotides in the targeted exome regions with coverage at least r as a function of sequencing effortt. In general, the prediction by u r;m (t) is accurate for all combinations ofr andt, and for all exome capture systems, with the mean of the absolute relative errors among multiple samples in each exome capture system smaller than 21:6% (Figure 4.1). In particular, when r = 4, the mean absolute relative errors were 1.9%, 3.5%, 5% and 3.5% att = 2; 5; 10; 20 respectively over all 20 datasets. The mean absolute relative error did not change much whenr increased to 8 and 10, except fort = 2 where the relative errors increased for all four exome capture systems. The complete data for relative errors and their confidence intervals for all samples and all differentt and r can be found in Table E.1 in Supplementary Materials. The confidence intervals were estimated based on 100 bootstrap samples of the initial sample. The size of confidence intervals on average increased asr increased. The estimation for larger is in general more difficult than that for small r. This is because of the fact that only few nucleotides can be observed in the initial sample having coverage greater thanr, resulting a large variation of the estimation. Enlarging the size of initial 60 DOP−PCR MALBA C MDA LIANTI 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0 0.1 0.2 0.3 relative error r = 4 r = 8 r = 10 Figure 4.1: Relative errors of r;m (t) for WES datasets. The mean of the absolute relative errors of u r;m (t) over WES samples from the same exome capture systems for t = 2; 5; 10; and 20 and r = 4; 8 and 20. The predictions were based on u r;m (t) using initial samples of 1M reads. The relative error is calculated as the difference between the estimate and the true numbers of nucleotides covered by at leastr reads, divided by the true. (a)r = 4; (b)r = 8; (c)r = 10. sample can reduce both the relative error and the variation of the estimation. For example, the mean absolute relative error drops from 0.158 to 0.022 att = 2 andr = 10 when using 10 M reads as the initial sample, comparing with that using 1 M reads as the initial sample. The size of the confidence interval also decreases from 0.219 to 0.020 on average. 61 4.2.2 Minimally sufficient number of sequences for WES experiments Based on the evaluation of the prediction accuracy of our methods, we are confident to use our method to estimate the amount of bases to sequence in WES in order to attain the sufficient cov- erage of the genome. In particular, the objective is to estimate the minimum amount of bases to sequence such that the percent of the targeted exome regions is covered by at least r unique reads. Though different criteria of andr are chosen by the researchers’ own needs, our method have shown accurate estimations for the genome coverage forr 10, when using 1 M reads as the initial sample. For the 20 WES samples from four different exome capture systems, we estimated the statis- tic u r;m (t) as the function of sequencing effort t for each of the sample. We set = 85%, which is the criteria used by Broad Institute, as the goal for designing the sequencing experiment. Table 4.2 shows the predicted minimally sufficient number of reads in each WES experiment for r = 4; 8 and 10 based on initial experiments of 1M reads. The infinity in the prediction means that it is impossible to satisfy the criteria no matter how deep the library is sequenced. The min- imally sufficient amount to satisfy the given criterion varied for different exome capture systems and was consistent within the datasets from the same exome capture system. WES datasets from SureSelect Human All Exon used the fewest number of sequenced bases to achieve the same cri- terion compared with other exome capture systems. Further, the length of the targeted regions for SureSelect Human All Exon is actually longer compared with other exome capture systems. The results above indicates SureSelect Human All Exon has the best uniformity of coverage among four exome capture systems. In other words, SureSelect Human All Exon is the most efficient exome capture system for WES among the four exome capture systems in this study. On the other hand, the prediction result suggests that sequencing experiments based on Nextera Rapid Capture Exome may not cover 85% of targeted regions with coverage depth of at least 10 no matter how many reads are sequenced. To confirm this result, we counted the percentage of targeted regions with coverage at least 10 for full datasets on Nextera Rapid Capture Exome, 62 Table 4.2: Minimally sufficient number of sequences for WES experiments. The estimated min- imally sufficient number of sequencing that can satisfy 85% of targeted regions with coverage at leastr = 4; 8 and 10 in a WES experiment, based on an initial sample of 1M reads. The values in the parentheses are 95% confidence interval constructed by bootstrapping the inital samples. bait accession id minimally sufficient amount of sequenced bases (G) r = 4 r = 8 r = 10 NimbleGen SRR1301329 2.1 (1.9, 2.3) 5.1 (4.2, 5.9) 7.1 (5.4, 8.9) SRR1301343 1.8 (1.7, 1.8) 4.1 (3.7, 4.6) 5.6 (4.9, 6.6) SRR1301345 2.1 (1.5, 3.4) 5.0 (3 , 12.3) 6.8 (3.3,1 ) SRR1301347 1.3 (1.3, 1.3) 3.6 (3.3, 4.0) 5.3 (4.6, 6.7) SRR1301348 1.8 (1.8, 1.9) 4.3 (3.8, 4.8) 5.9 (4.7, 7.5) SureSelect ERR315871 1.1 (1.0, 1.3) 2.5 (2.1, 3.0) 3.9 (3.2, 4.4) ERR315877 1.2 (1.0, 7.5) 2.7 (2 , 41.2) 3.6 (2.6, 14.6) ERR315885 1.3 (1.2, 1.3) 2.9 (2.7, 3.1) 4 (3.7, 4.2) ERR315904 1.2 (1.1, 1.6) 2.6 (2.3, 3.2) 3.5 (3.2, 4.1) ERR315908 1.6 (1.5, 1.6) 3.8 (3.4, 4.2) 5.3 (4.7, 5.9) TruSeq NA12878-5 1.3 (1.2, 1.3) 3.6 (3.1, 4.3) 5.6 (4.6, 7.3) NA12878-12 1.3 (1.1, 1.5) 3.7 (3.0, 5.1) 6 (4.1, 12.9) NA12878-16 1.4 (1.3, 1.4) 4.2 (3.5, 5.4) 7.2 (5.5, 11.1) NA12878-29 1.3 (1.3, 1.4) 4.1 (3.5, 5.0) 6.5 (4.8, 12.6) NA12878-41 1.3 (1.2, 1.3) 3.6 (3.1, 4.2) 5.5 (4.5, 7.5) Nextera Rapid Capture N701-S3 2.5 (2.1, 3.0) 11.6 (5.7,1) 37.8 ( 8.0,1) N701-S4 2.5 (2.0, 3.3) 11.8 (6.3,1) 46.2 ( 7.5,1) N707-S2 2.2 (1.9, 2.7) 15.7 (5.9,1) 1 ( 9.4,1) N707-S3 2.4 (2.1, 2.7) 27.4 (8.9,1) 1 (34.2,1) N707-S4 2.2 (1.8, 2.9) 13.7 (5.7,1) 1 ( 9.8,1) none of them reach 85%. The results illustrated that the samples from different exome capture systems had different library complexity and eventually required different sequencing efforts to achieve the same criterion on genome coverage. Our method can accurate estimate the genome coverage as the number of sequenced bases increases based on the inital sample from a shallow sequencing experiment, and the optimal sequencing effort can be determined accordingly based on our estimation. 63 4.2.3 Predicting genome coverage for scWGS samples Single-cell sequencing is another popular approach which is specifically designed to study the genetic mutations with resolution at the cell level. Because of the limited DNA materials in a single cell, scWGS requires pre-amplification step to amplify the original DNA and obtain enough DNA materials for the sequencing step. Multiple strategies for pre-amplification have been proposed, ranging from the early exponential amplification method like DOP-PCR [13] and MDA [30, 63], quasi-linear amplification used in MALBAC [118], and the newest linear amplification method LIANTI [16]. Due to the pre-amplification, scWGS samples exhibit a large sequencing bias in terms of genome coverage and a high dropout rate for targeted regions, compared with samples from the bulk sequencing, which makes the prediction of genome coverage more challenging. We investigated the prediction of our method for predicting genome coverage for samples generated from scWGS. As before, the genome coverage is measured using the percentage of nucleotides in the genome that is covered by at leastr reads. We usedr = 4; 8; 10 because they are the commonly used thresholds in literature. Twelve scWGS samples, generated using the four protocols in the study of Chen et al. [16], were used for the analysis. The single cell samples are from the BJ cell line, a human diploid cell line in skin fibroblasts. This dataset enabled us to provide a comprehensive evaluation of the prediction accuracy for scWGS experiments. Table 4.5 lists the accession numbers of the samples used in this study. To evaluate the prediction accuracy of our method, we downsampled a subset of 5 M reads from the full data for each of the sample, and used the dataset as the initial sample to predict genome coverage in a large sample, which is 2, 5, 10, and 20 times the size of the initial sample. Here we used a larger initial sample of 5 M reads compared with the size of the initial sample used in the WES studies, because scWGS targets at the whole genome instead of the restricted exome regions, and thus it usually requires much more reads than WES for predictions. In addition, not like the WES samples, we did not remove duplicated reads in scWGS samples and analyzed data using the original statistic r;m (t) developed for the general DNA-sequencing sample. We 64 Table 4.3: Single-cell whole-genome sequencing datasets. Datasets are scWGS of the BJ cell, a human diploid cell line from skin fibroblasts chosen for no aneuploidy. accession id protocol kit SRX2660674 DOP-PCR Sigma SRX2660675 SRX2660676 SRX2660677 MALBAC Yikon SRX2660678 SRX2660679 SRX2660680 MDA Qiagen SRX2660681 SRX2660682 SRX2660683 LIANTI NEBNext Ultra SRX2660684 SRX2660685 then compared the predicted genome coverage and the ground truth, and used the relative error to assess the prediction accuracy of our method. See Methods for the details for processing scWGS sequencing data. The predicted genome coverage in the large sample is accurate in general based on the obser- vations in the initial samples (Figure 4.2). Since each protocol have multiple samples, we used the mean of the absolute relative errors to compare between different protocols. Though the protocols used different pre-amplication methods, the predicted coverage for the data from the four protocols were all generally accurate, with relative errors smaller than 20%. In particular, whenr = 4, the absolute relative errors were smaller than 5% for all four single-cell protocols. The relative error only changed slightly as the relative sample sizet increased. Though the relative errors forr = 8 andr = 10 were overall larger than that forr = 4, the relative errors forr = 8 were all 11% and that forr = 10 were all 20%. Among all four protocols, LIANTI had the largest mean absolute relative errors at all different combinations ofr andt. One possible reason for this is that, since LIANTI uses the linear amplification method, the reads are more likely to uniformly distributed along the genome. Our previous results showed the method had high prediction accuracy when 65 DOP−PCR MALBA C MDA LIANTI 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0 0.1 0.2 0.3 relative error r = 4 r = 8 r = 10 Figure 4.2: Relative errors of r;m (t) for scWGS datasets. The mean of the absolute relative errors over scWGS samples from the same pre-amplification protocols for t = 2; 5; 10 and 20 andr = 4; 8; and 10. The predictions were based on r;m (t) using initial samples of 5M reads. The relative error is calculated as the difference between the estimate and the true numbers of nucleotides covered by at leastr reads, divided by the true. (a)r = 4; (b)r = 8; (c)r = 10. the coverage was heterogenous. Though the reads from LIANTI cannot be purely homogenously distributed, this can explains why the prediction by the model was less accurate for LIANTI than for other protocols. The complete table listing the relative errors for all the samples and protocols can be found in Table E.2 in Supplementary Materials. 66 4.2.4 Optimal number of sequenced bases for scWGS experiments Recall that one interesting quantity for sequencing experiments is the number of nucleotides in the genome with sufficient coverage. In the ideal case, one could define the optimal sequencing effort as the minimum sequencing depth such that every position in the genome is sufficiently covered. However, this criterion is useless in practice. Due to high dropout events and amplification bias for single-cell libraries, it is impossible to achieve even a relaxed version of the criterion no matter how deep to sequence the library. In fact, 9 out of 12 single-cell sequencing datasets in our study did not even have 85% of the genome covered by at least 4 reads. Thus, setting the criteria on as the goal for scWGS is sometimes not feasible. As the sequencing effort keeps increasing, the number of nucleotides with sufficient coverage increases and then becomes stable. Eventually, the library becomes saturated, and increasing the sequencing depth can barely obtain new nucleotides with sufficient coverage. From the perspective of sequencing efficiency, we considered using the benefit-cost ratio as the criteria for determining how deep to sequence. Here the benefit-cost ratio is defined as the expected increase in the number of nucleotides with coverage at least r per sequenced base. See Methods for the details of the definition. Intuitively, the ratio is 0 forr > 1 at the very beginning of the sequencing experiment, because it is unlikely for a nucleotide being covered byr reads given the small number of reads and the long genome. The ratio increases as the sequencing experiment continues, and then it starts to decrease, when most of the nucleotides have been covered by more thanr reads and the nucleotides without sufficient coverage may come from the regions that are hardly sequenced due to dropout events or amplification bias. Eventually, the ratio approaches 0. To best use of sequencing resource, one should stop sequencing when the benefit-cost ratio is markedly low. Therefore, we define the optimal number of sequenced bases as the maximum value such that the benefit-cost ratio is greater than a given threshold. Given the initial sample, we used r;m (t) to estimate the number of nucleotides covered by at least r reads as a function of sequencing effort t. The number of bases to sequence was then 67 computed by tN, where N is the number of sequenced bases in the initial sample. We chose a threshold of the benefit-cost ratio, and obtained the point on the curve of r;m (t) where the benefit-cost ratio (the derivative) is equal to the threshold. Then the obtained value is the optimal number of bases to sequence in order to avoid wasting resources when obtaining new nucleotides becomes inefficient. Table 4.4 shows the estimated optimal numbers of bases to sequence when the benefit-cost ratio was set as 0.01, based on a shallow sequencing experiment of 5 M reads. In general, the optimal numbers of sequenced bases were similar for samples from the same protocols and were different for samples from different protocols. In addition, the optimal numbers of sequenced bases increased as the value ofr increased. For example, in the dataset of SRX2660685, the number of bases that should be sequenced was 61.0 billion, which had an average sequencing depth around 61:0=3 20 atr = 4. Asr increased to 10, the average depth increased to 89:5=3 30. Among all the four protocols, LIANTI had the largest optimal sequencing size, and followed by MDA, MALBAC, and DOP-PCR. This is because LIANTI had the least dropout rate, which will be discussed in the following section. In other words, the benefit-ratio for LIANTI decreased slowly compared with that for other protocols, suggesting the high complexity and low saturation of the library in LIANTI. On ther other hand, small values of the optimal sequencing size indicated the early stop in sequencing experiment to save resources because the library was highly biased and further sequencing hardly gained new nucleotides. Note that all these estimates were based on aligned reads. In order to transform the number of bases from the aligned reads to the number of bases in the raw reads, one should adjust these values by dividing the mapping rate. For example, the mapping rate of the dataset SRX2660685 is approximately 82.1% based on the initial 5M reads. Thus, the optimal number of sequences for the raw reads should be 74.3 billion (24.8) forr = 4 and 109.0 billion (36.3) forr = 10. 68 Table 4.4: Optimal number of sequenced bases for scWGS experiments. The estimated optimal number of sequenced bases in the scWGS experiment based on an initial sample of 5 M reads. Here the optimal number was determined as the maximum number of sequenced bases with its benefit-cost ratio greater than 0.01. The minimum coverages arer = 4 andr = 10. The values in the parentheses are 95% confidence interval estimated from the Bootstrap. protocol accession id optimal sequenced bases (G) r = 4 r = 8 r = 10 DOP- PCR SRX2660674 19.3 (14.6, 25.3) 18.912 (16.058, 23.437) 20.6 (17.6, 24.3) SRX2660675 23.1 (22.9, 23.4) 20.541 (17.186, 25.498) 20.3 (19.8, 19.8) SRX2660676 21.7 (21.0, 22.3) 28.054 (28.267, 27.868) 30.1 (29.8, 30.3) MALBAC SRX2660677 42.7 (41.5, 43.8) 54.677 (55.536, 53.36) 57.7 (56.2, 59.3) SRX2660678 44.2 (43.6, 44.7) 55.426 (54.956, 55.824) 58.1 (56.7, 59.6) SRX2660679 46.0 (43.1, 46.8) 62.383 (53.83, 67.899) 60.0 (47.8, 83.6) MDA (Qiagen) SRX2660680 52.3 (51.6, 53.0) 68.746 (66.898, 70.268) 74.4 (73.4, 75.7) SRX2660681 51.7 (51.6, 57.7) 63.293 (54.997, 69.693) 56.9 (46.0, 83.2) SRX2660682 49.8 (47.7, 55.5) 65.79 (59.245, 73.767) 70.8 (69.9, 72.3) LIANTI SRX2660683 57.3 (54.5, 60.9) 77.072 (76.38, 77.767) 84.4 (81.8, 87.2) SRX2660684 52.1 (51.2, 65.2) 77.904 (82.622, 74.708) 83.1 (78.8, 87.0) SRX2660685 61.0 (56.5, 65.8) 83.409 (82.311, 83.869) 89.5 (84.3, 92.5) 4.2.5 Dropout for scWGS and uniformity of coverage The dropout rate and uniformity of genome coverage in scWGS are important quantities for eval- uating the quality of sequencing library. Since the DNA materials in a single cell is very limited, scWGS requires pre-amplification in order to obtain enough DNA materials to sequence. Due to the non-uniformity of pre-amplification procedure, single-cell sequencing often exhibits a high dropout rate and a large variation in genome coverage. We are especially interested in assessing the dropout rates and uniformity of genome coverage in the four different scWGS protocols, based on the information of the predicted curves for the estimated number of nucleotides with sufficient coverage as a function of the number of sequenced bases. In fact, as the sequencing effort keeps increasing, the number of nucleotides with suffi- cient coverage increases and then becomes stable. A higher number of nucleotides with sufficient coverage reflects a more complete library and a smaller dropout rate. Ideally, if the number of nucleotides covered by reads equals to the total number of nucleotides in the genome, the dropout 69 rate is zero. Thus, we predicted the number of nucleotides with sufficient coverage as the num- ber of sequenced bases increased, and then compared the four different scWGS protocols in terms of the dropout rates. As a control, we obtained the curve for the bulk sequencing data, which is generated through subsampling the full dataset. We also constructed a curve for the ideal case where reads were randomly generated from the genome using a homogeneous Poisson process. Although it is very difficult to accurately approximate the dropout rate, the dropout rate can be visually assessed by the closeness between the predicted curves for the scWGS protocols and that of the ideal case. Figure 4.3(a) showed the curves for the estimated number of nucleotides with coverage at least r = 4 as a function of the number of sequenced bases. Since there were multiple samples from each of the four protocols, we averaged the curves over the multiple samples from the same pro- tocol. The bulk sequencing showed superior uniformity of coverage compared with the single-cell sequencing in this case. The curve from the bulk sequencing closely tracked that of the ideal case. The curves from single-cell sequencing, on the other hand, showed discrepancies among different single-cell protocols. When sequencing 50 billion bases, the number of nucleotides covered by at least 4 times for the protocol LIANTI was around 2.61 billion, achieving the highest among all four protocols (Figure 4.3(a)). Since human genome is around 3 billion base pairs, LIANTI was able to cover over 87% of the positions in the human genome. The curve of LIANTI was the closest among the four protocols to that of the bulk sequencing, suggesting the lowest dropout rate. Meanwhile, the number of nucleotides with sufficient coverage differ significantly among scWGS protocols. The second best scWGS protocol MDA covered about 1.93 billion nucleotides, when the same number of bases were sequenced. The difference between LIANTI and MDA was 0.67 billion bases, which accounted for around 22.3% of the genome length. MALBAC and DOP-PCR had only about 1.5 and 1 billions of nucleotides covered by at least 4 reads, respectively, with the same number of sequenced bases. We ranked all protocols in term of the number of nucleotides with sufficient coverage. It was clear that LIANTI had the lowest dropout rate, followed by MDA, MALBAC and DOP-PCR. 70 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 sequenced bases (G) bases with coverage at least 4 (G) DOP−PCR MALBA C MDA (Qiagen) LIANTI Bulk Poisson 0 2 4 6 8 10 0 0.5 1 1.5 sequenced bases (G) bases with coverage at least 4 (G) (b) (a) Figure 4.3: The number of nucleotides in the human genome covered by at least r = 4 reads as a function of sequenced bases for scWGS datasets. For each scWGS sample, the curves were estimated based on an initial sample of 5M reads. The mean value was taken over samples from the same protocol withr = 4. For the bulk sequencing, the curve was generated by subsampling the full dataset. The dash line was based on the simulated data using a homogenous Poisson process, assuming reads are uniformly sampled distributed along the genome. This curve represented the ideal case for a uniform sequencing experiment. (a) The curves for extrapolated values of genome coverage from 0 to 50 billion sequenced bases. (b) The zoom-in of the curves for the number of sequenced bases from 0 to 10 billion. The low dropout rate of LIANTI was a consequence of the linear pre-amplification method used in the scWGS experiment. The linear amplification largely reduced the amplification bias and the variation of coverage along the genome, resulting a more uniformed library than the non-linear amplification protocols. Thus, LIANTI successfully covered the largest number of the nucleotides in the genome, compared with other three protocols. Besidesr = 4, we also studied the results for r = 8 and 10. The trend did not change and LIANTI always had the highest number of nucleotides among all the protocols. Our observation for the dropout rates of the four protocols was consistent with that in the original study [16]. Therefore, our method accurately assessed the four protocols in terms of their dropout rates, using the prediction based on an initial sample of 5 M reads, which was less than 5% of the original data. 71 When the number of sequenced bases was less than 4 billion, the curve for single-cell sequenc- ing was higher than that for the bulk sequencing (Figure 4.3(b)), which indicates the higher ampli- fication bias of single-cell libraries compared with that of the bulk sequencing. This is because, if there is no bias in generating reads in a library, at the beginning of the sequencing experiment, the read should uniformly distributed along the long genome, resulting in few nucleotides covered by at leastr = 4 times. Within single-cell protocols, the curve for LIANTI was lower than that for other protocols when the number of sequenced bases was less than 4 billion, suggesting that LIANTI exhibited better uniformity of coverage than the other single-cell protocols. Note that the order of the curves from the highest to the lowest was opposite between the small and the large sequencing samples. The crossover between the curves of single-cell sequencing and bulk sequencing happened when the number of bases sequenced were around 4-8 billion. After that, the curve for the bulk sequencing was always above the curves for single-cell sequencing. This observation of crossover suggests that the protocol which has the largest number of nucleotides with sufficient coverager > 1 is not simply and necessarily the one with the best uniformity of the library. Only when sequencing depth is large enough, the above statement can be true. The researchers need to be cautious when they infer the library uniformity by simply mapping the reads to the genome and count the number of nucleotides with sufficient coverage. If the sample size is small, one can draw an opposite wrong conclusion. Therefore, our method provides a useful tool to extrapolate the curves for a large sequencing sample, and then the correct inference on the library uniformity can be easily drawn. 4.3 Methods 4.3.1 Whole-exome sequencing datasets and data processing Whole-exome sequencing datasets were selected from four independent projects. For each project we chose 5 deeply sequenced datasets as a representative (Table 4.1). In the first project, DNA 72 from whole-blood in the Simons Simplex Collection (SSC) was subjected to the exome capture using the Nimblegen Custom array (SeqCap EZ Exome v2) and sequenced with 74 bp paired- end reads using Illumina GAIIx [93]. For the second project, DNA from the flow-sorted cell populations from patients with chronic lymphocytic leukemia (CLL) were subjected to the exome capture using the SureSelect Human All Exon (SureSelect V4 Mb All Exon kit) [29]. The library was sequenced with 100 bp paired-end reads on the Illumina HiSeq 2000. In the third and fourth projects, DNA from Coriell sample NA12878 was subjected to the exome capture using Illumina’s Nextera Exome Enrichment (Nextera Rapid Capture Exome v1.2) and TruSeq Exome Library Preparation kit respectively, and was then sequenced using a HiSeq 4000 instrument. To extract the number of nucleotides covered by j reads in the targeted regions, for j = 1; 2;::: , we first mapped each dataset to the human genome reference GRCh38/hg38 using BWA with default parameters [64]. We then used Picard tools [9] to sort mapped reads and removed potentially duplicated reads, which were defined as reads mapped to the same locations. Samtools [65] was used to obtain the numberN j of nucleotides in the targeted regions covered byj uniquely aligned reads. For WES capture kits designed for hg19, we used the tool Liftover [59] to lift the genome annotations to hg38. 4.3.2 Single-cell whole-genome sequencing datasets and data processing The datasets were from the study by Chen et el. [16] and were downloaded through NCBI with accession numbers and sequencing protocols in Table 4.5. The dataset includes samples from four major single-cell sequencing protocols, namely DOP-PCR, MDA, MALBAC, LIANTI. DOP- PCR (Degenerate Oligonucleotide-Primed Polymerase Chain Reaction)[13] and MDA [30, 63] (Multiple-Displacement Amplification) use the exponential pre-amplification approach. MAL- BAC (Multiple Annealing and Looping-Based Amplification Cycles) uses a quasi-linear pre- amplification approach [118]. LIANTI (Linear Amplification via Transposon Insertion), which is the state of the art, uses a linear pre-amplification approach [16]. 73 Table 4.5: Single-cell whole-genome sequencing datasets. Datasets are scWGS of the BJ cell, a human diploid cell line from skin fibroblasts chosen for no aneuploidy. accession id protocol kit SRX2660674 DOP-PCR Sigma SRX2660675 SRX2660676 SRX2660677 MALBAC Yikon SRX2660678 SRX2660679 SRX2660680 MDA Qiagen SRX2660681 SRX2660682 SRX2660683 LIANTI NEBNext Ultra SRX2660684 SRX2660685 The data was preprocessed as follows. In order to extract the number of nucleotides in the genome covered by j reads, for j = 1; 2;:::, we first used lianti trim [16] to trim off Illumina sequencing adapters, inline barcodes, binding sequences and T7 promoter sequences. Then reads were mapped to human genome reference GRCh38/hg38 using BWA with default parameters [64]. We used Picard tools [9] to sort the mapped reads. The programmpileup in Samtools was used to obtain the numberN j of base pairs in the genome coveredj reads [65]. 4.3.3 Predicting nucleotides covered at leastr unique reads in WES experi- ments Estimating the number of nucleotides covered by least r uniquely mapped reads in the targeted exome regions as a function of sequencing effort requires two steps. The first step is to estimate the function of U(t) which is the number of uniquely mapped reads as a function of sequencing effortt. The second step is to predict the number of nucleotides in the targeted regions covered by at leastr uniquely mapped reads as a function of sequencing effort. Estimators in both steps can 74 be obtained through the same initial experiment using the estimator proposed by [32]. However, the statistics summarized from the initial experiment are different in the two steps. In the first step, we treat reads as individuals and unique reads as species. Based on the initial experiment, the numberN j of unique reads represented exactlyj times is counted forj = 1; 2;:::. Using the counts statisticsN j , we apply r;m (t) withr = 1 to estimate the number of unique reads as a function of sequencing effort. We denote the estimator as U(t). LetN u = P 1 j=1 N j denote the number of unique reads in the initial experiment. The effective sequencing effort is calculated as U(t)=N u . In the second step, we count the numberN 0 j of nucleotides in the targeted regions covered by j unique reads in the initial experiment forj = 1; 2;:::. Here each sequenced base from unique reads is treated as an individual and each nucleotide in the targeted regions is treated as a species. Based onN 0 j , we apply the RFA approach to obtain the number of base pairs in the targeted regions with coverage at leastr as a function of effective sequencing effort. We denote this estimator as r;m (t 0 ), where the variablet 0 is the effective sequencing effort. Combining results from above two steps, the number of base pairs in the targeted regions with coverage at leastr as a function of sequencing effort, can be estimated as u r;m (t) = r;m (U(t)=N u ): (4.1) 4.3.4 Estimating the benefit-cost ratio The benefit-cost ratio measures the increase in the number of nucleotides being sufficiently covered if one additional base is sequenced. Assume the estimator r;m (t) can accurately predict the number of nucleotides covered by at leastr reads. The benefit-cost ratio can be approximated as r;m (t + 1=N) r;m (t); 75 whereN is the number of sequenced bases in the initial experiment. SinceN is large in general, the accurate calculation of the difference relies on the specific implementation of the numeric procedure. In order to improve the numeric stability, we use the average increase in the number of nucleotides with coverage at leastr per sequenced base, to approximate the ratio. The cost-benefit ratio is estimated as r;m (t +N b =N) r;m (t) N b ; whereN b is a large positive constant referring to the number of sequenced bases. 4.4 Conclusion and Discussion In this chapter, we developed a procedure to predict the optimal amount of bases to sequence based on a shallow sequencing experiment. The goal is to estimate how much to sequence in order to guarantee the genome is covered by sufficiently number of reads. Given the shallow sequencing data, we predicted the number of nucleotides with sufficient coverage as a function of the sequencing effort. We compared our predictions with the true value, and evaluated the relative errors of our method. The relative errors were in general low for all samples from different WES methods and scWGS protocols, when the sufficient coverage is defined as the nucleotide being covered by at leastr 10 reads. Based on the function, the method was then used to predict the optimal number of bases to sequence based on two criteria, where one is to achieve the majority of the genome is sufficiently covered, and the other is to assure the sequencing efficiency based on benefit-cost ratio. Our method can play important roles in many different kinds of sequencing experiments. For the large sequencing projects, such as the international 1000 Genome Project and the UK 10,000 Genome Project, one crucial question is how to balance between the number of individuals in the study and the sequencing depth for each individual. Under a fixed budget, scientists would like to sequence as many individuals as possible to enhence the statistical power of the study, while still make sure the sequencing in each sample is deep enough to guarantee accurate mutation inferences. 76 Having a small but sufficient sequencing sample also largely saves computing time and resource for data processing and requires less data storage. Using a shallow sequencing experiment as a pilot study, our method provides an accurate estimation for the genome coverage in the future large sequencing sample, and helps scientists determine the minimal sequencing depth to obtain sufficient coverage. Sequencing technology advances quickly and new sequencing methods, with variations in each step of the experiment, are proposed for detecting different kinds of mutations at both single- cell and tissue levels. The developmental stage of a new sequencing method always involves a large number of sequencing experiments for testing the method and evaluating the data quality, which takes up the majority of the time and labor in a study. Our study showed that evaluating the number of nucleotides with sufficient coverage in a shallow sequencing sample can lead to a wrong conclusion for the library quality. Only by deep sequencing, the library quality can be accurately assessed. However, deep sequencing is time-consuming and resource intensive, and the following data processing steps also takes much long time. Our method provides an efficient tool for accurately predict the number of nucleotides with sufficient coverage in a deep sequencing sample, based on the data from a shallow sequencing experiment. The results showed that the assessment of the uniformity of the four scWGS protocols using our prediction based on shallow sequencing was consistent with the conclusion in the original paper using the deep sequencing. With the help of our method, researchers can evaluate the newly proposed method at a early stage without wasting time and money for a deeper sequencing. The method will largely saves the sequencing resources and speed up the development process. Determining the optimal depth of scWGS libraries is challenging because of the variation of coverage and the high dropout rate. It is possible that no matter how many reads are sequenced, the library may not achieve a predefined coverage, e.g., 85% of genomic positions with coverage at least 10. We proposed a criterion to obtain the optimal sequencing depth based on the benefit- cost ratio, the increase in the number of nucleotides with sufficient coverage per additional base to sequence. For a fixed value ofr, the optimal number of sequences depend on the complexity of a 77 sequencing library. There is no universal rule that can be applied to all libraries. For instance, if the library has high complexity, such as LIANTI and MDA in this case, the benefit-cost ratio decreases slowly and sequencing deeper help to discover more positions in the genome. In contrast, if the library has low complexity, e.g., DOP-PCR, one should stop sequencing earlier to save resources. The choice ofr depends on the purpose of the study and it relies on the experience of experimental designers. Once the value ofr is determined, our method can be easily applied to determine the desirable depth for the sequencing experiment. To account for the step of removing duplicated reads in WES data processing, we developed a novel estimator u r;m (t) which estimates the number of uniquely mapped reads as a function of sequencing effort. The new estimator first estimated a function to transform the total number of mapped reads to the number of uniquely mapped reads. Then the number of uniquely mapped reads was predicted as the sample size increased. Given this new statistic, one can easily predict as the number of sequenced bases increases, the number of nucleotides covered by at leastr uniquely reads. The predictions for genome coverage were in general accurate with means of the absolute relative errors smaller than 21.6% for both WES and scWGS samples, wheret was evaluated at 2, 5, 10, and 20, andr was evaluated at 4, 8, and 10. Though we cannot evaluate the prediction accuracy for a larger t using the current samples, previous evaluations using simulated data and other real data have shown the accurate prediction of our method fort as large as 100 [32]. For r > 10, we evaluated the relative errors using the current dataset, and we found that the relative errors increased asr increased. This phenomenon is possibly due to the fact that the estimator for the number of nucleotides covered by at leastr reads was derived from the (r 1)-th derivative of the estimator for the number of nucleotides covered by at least 1 read. Whenr is large, there would be a large variation in the estimation due to the noise accumulated from (r 1) times of derivative. Although many experiments use paired reads, our experience suggests to treat them as single- end reads when applying our method. The SNV calling is based on the actual reads that align to the 78 nucleotides in the genome. If the position is in the gap between two paired reads, this alignment does not provide the information to identify the nucleotide in this position. If the position is in the overlap of two paired reads, the coverage of this position is counted as 2. In either way, it is equivalent to treating them as single-end reads. As a summary, this study proposed an reliable procedure for optimizing sequencing depth in a DNA sequencing experiment. Based on an initial sample from a shallow sequencing experiment, our method can accurately predict the number of nucleotides in the genome with sufficient cov- erage as the sequencing effort increases. The optimal sequencing depth can be then determined based on the prediction. In addition, the library uniformity can also be inferred from the predicted genome coverage for a deep sequencing sample. The proposed procedure will play important roles in optimizing sequencing experiments and allocating recources in the big genome projects, as well as speeding up the development process of new sequencing methods. 79 Reference List [1] Baayen RH (2001) Word Frequency Distributions Kluwer, Dordrecht. [2] Baker GA (1975) Essentials of Pad´ e Approximants Academic, New York. [3] Baker GA, Graves-Morris PR (1996) Pad´ e Approximants Cambridge University Press, Cam- bridge. [4] Bao R et al. (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer informatics 13s2. [5] Birney E, Soranzo N (2015) The end of the start for population sequencing. Nature 526:52–53. [6] Boneh S, Boneh A, Caron RJ (1998) Estimating the prediction function and the number of unseen species in sampling with replacement. Journal of the American Statistical Associa- tion 93:372–379. [7] Bradnam KR et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2:10. [8] Britanova OV et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. The Journal of Immunology 192:2689–2698. [9] Broad Institute (2013) Picard Tools http://broadinstitute.github.io/picard. [10] Bulmer MG (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics 30:101–110. [11] Bunge J, Fitzpatrick M (1993) Estimating the number of species: A review. Journal of the American Statistical Association 88:364–373. [12] Burrell QL, Fenton MR (1993) Yes, the gigp really does workand is workable! Journal of the American Society for Information Science 44:61–69. [13] Carter NP et al. (1992) Degenerate oligonucleotide-primed pcr: general amplification of target dna by a single degenerate primer. Genomics 13:718–725. [14] Chao A, Lee SM (1992) Estimating the number of classes via sample coverage. Journal of the American statistical Association 87:210–217. 80 [15] Chao A, Shen TJ (2004) Nonparametric prediction in species sampling. Journal of Agricul- tural, Biological, and Environmental Statistics 9:253–269. [16] Chen C et al. (2017) Single-cell whole-genome analyses by linear amplification via transpo- son insertion (lianti). Science 356:189–194. [17] Chen K, Pachter L (2005) Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS computational biology 1:e24. [18] Chilamakuri CSR et al. (2014) Performance comparison of four exome capture systems for deep sequencing. BMC Genomics 15:1–14. [19] Choi M et al. (2009) Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences 106:19096–19101. [20] Cohen AC (1960) Estimating the parameters of a modified Poisson distribution. Journal of the American Statistical Association 55:139–143. [21] Collins FS, Varmus H (2015) A new initiative on precision medicine. New England Journal of Medicine 372:793–795 PMID: 25635347. [22] Colwell RK, Coddington JA (1994) Estimating terrestrial biodiversity through extrapolation. Philosophical Transactions of the Royal Society of London B: Biological Sciences 345:101–118. [23] Colwell RK, Mao CX, Chang J (2004) Interpolating, extrapolating, and comparing incidence- based species accumulation curves. Ecology 85:2717–2727. [24] Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29:987–991. [25] Consortium IHGS (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. [26] Daley T (2014) Non-parametric models for large capture-recapture experiments with appli- cations to DNA sequencing Ph.D. diss., University of Southern California. [27] Daley T, Smith AD (2013) Predicting the molecular complexity of sequencing libraries. Nature Methods 10:325–327. [28] Daley T, Smith AD (2014) Modeling genome coverage in single-cell sequencing. Bioinfor- matics 30:3159–3165. [29] Damm F et al. (2014) Acquired initiating mutations in early hematopoietic cells of CLL patients. Cancer Discovery 4:1088–1101. [30] Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, Sun Z, Zong Q, Du Y , Du J et al. (2002) Comprehensive human genome amplification using multiple displacement amplification. Proceedings of the National Academy of Sciences 99:5261–5266. 81 [31] Deng C, Daley T, Smith AD (2015) Applications of species accumulation curves in large- scale biological data analysis. Quantitative Biology 3:135–144. [32] Deng C, Smith AD (2016) Estimating the number of species to attain sufficient representation in a random sample. arXiv preprint arXiv:1607.02804 . [33] Ding L et al. (2012) Clonal evolution in relapsed acute myeloid leukaemia revealed by whole- genome sequencing. Nature 481:506–510. [34] Efron B, Thisted R (1976) Estimating the number of unsen species: How many words did shakespeare know? Biometrika 63:435–447. [35] Efron B, Tibshirani RJ (1994) An Introduction to the Bootstrap Chapman & Hall, London. [36] Engen S (1978) Stochastic Abundance Models Chapman and Hall, London. [37] Evert S, Baroni M (2007) zipfR: Word frequency distributions in R In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, pp. 29–32, Prague, Czech Republic. [38] Feller W (1968) An Introduction to Probability Theory and Its Applications, V ol. 1, 3rd ed John Wiley & Sons. [39] Fisher RA, Corbet AS, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12:42–58. [40] Freeman JD et al. (2009) Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research 19:1817–1824. [41] Fujimoto A et al. (2012) Whole-genome sequencing of liver cancers identifies etiological influences on mutation patterns and recurrent mutations in chromatin regulators. Nature Genet- ics 44:760–764. [42] Gawad C, Koh W, Quake SR (2014) Dissecting the clonal origins of childhood acute lym- phoblastic leukemia by single-cell genomics. Proceedings of the National Academy of Sci- ences 111:17947–17952. [43] Gawad C, Koh W, Quake SR (2016) Single-cell genome sequencing: current state of the science. Nature Reviews Genetics 17:175–188. [44] Gilewicz J, Pindor M (1997) Pad´ e approximants and noise: A case of geometric series. Journal of Computational and Applied Mathematics 87:199–214. [45] Good IJ, Toulmin GH (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43:45–63. 82 [46] Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40:237–264. [47] Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: Procedures and pitfalls in the mea- surement and comparison of species richness. Ecology Letters 4:379–391. [48] Gragg WB (1972) The Pad´ e table and its relation to certain algorithms of numerical analysis. SIAM Rev. 14:1–62. [49] Greenwood M, Yule GU (1920) An inquiry into the nature of frequency distributions repre- sentative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. Journal of the Royal Statistical Society 83:255–279. [50] Hart, Michael (1971) Project Gutenberg https://www.gutenberg.org. [51] Harwit M, Hildebrand R (1986) How many more discoveries in the universe? Nature 320:724–726. [52] Hilbe JM (2011) Negative Binomial Regression Cambridge University Press, Cambridge. [53] Holdridge LR et al. (1971) Forest environments in tropical life zones: A pilot study . [54] Huberman B, Romero D, Wu F (2008) Social networks that matter: Twitter under the micro- scope. First Monday 14. [55] Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJM (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Applied and Environmental Microbiol- ogy 67:4399–4406. [56] Inc. I BaseSpace basespace.illumina.com. [57] Ionita-Laza I, Lange C, Laird NM (2009) Estimating the number of unseen variants in the human genome. Proceedings of the National Academy of Sciences 106:5008–5013. [58] Kalinin VM (1965) Functionals related to the Poisson distribution, and statistical structure of a text. Proceedings of the Steklov Institute of Mathematics 79:6–19. [59] Kent WJ et al. (2010) Bigwig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207. [60] Kroes I, Lepp PW, Relman DA (1999) Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences 96:14547–14552. [61] Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2:231–239. [62] Laydon DJ et al. (2014) Quantification of HTLV-1 clonality and TCR diversity. PLoS Com- putational Biology 10:e1003646. 83 [63] Leung K, Klaus A, Lin BK, Laks E, Biele J, Lai D, Bashashati A, Huang YF, Aniba R, Moksa M et al. (2016) Robust high-performance nanoliter-volume single-cell multiple dis- placement amplification on planar substrates. Proceedings of the National Academy of Sci- ences 113:8484–8489. [64] Li H, Durbin R (2009) Fast and accurate short read alignment with BurrowsWheeler trans- form. Bioinformatics 25:1754–1760. [65] Li H et al. (2009) The sequence alignment/map format and SAMtools. Bioinformat- ics 25:2078–2079. [66] Lindsay BG (1983) The geometry of mixture likelihoods: A general theory. Ann. Statist. 11:86–94. [67] Lindsay BG (1995) Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics 5:i–163. [68] Macosko E et al. (2015) Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161:1202–1214. [69] Magurran AE (1988) Ecological Diversity and Its Measurement Princeton University Press, Princeton, U.S.A. [70] Magurran AE (2013) Measuring Biological Diversity John Wiley & Sons. [71] Mandelbrot BB (1977) Fractals: Forms, Chance and Dimension Freeman, San Francisco. [72] Marc ¸ais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences ofk-mers. Bioinformatics 27:764–770. [73] Meienberg J et al. (2016) Clinical sequencing: is wgs the better wes? Human genet- ics 135:359–362. [74] Messaoudi I et al. (2002) Direct link between mhc polymorphism, T cell avidity, and diversity in immune defense. Science 298:1797–1800. [75] Meyer F et al. (2008) The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386. [76] Mora C et al. (2011) How many species are there on earth and in the ocean? PLoS Biol 9:e1001127. [77] Muir P et al. (2016) The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17:53. [78] Navin N et al. (2011) Tumour evolution inferred by single-cell sequencing. Nature 472:90–94. 84 [79] Newman ME (2005) Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46:323–351. [80] Ng SB et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276. [81] Ng SB et al. (2010) Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics 42:30–35. [82] Norris JL, Pollock KH (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environmental and Ecological Statis- tics 5:391–402. [83] Palmer MW (1990) The estimation of species richness by extrapolation. Ecol- ogy 71:1195–1198. [84] Pearman PB, Weber D (2007) Common species determine richness patterns in biodiversity indicator taxa. Biological Conservation 138:109–119. [85] Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences 98:9748–9753. [86] Pouya Razavi (2013) Genohub https://genohub.com. [87] Preston FW (1948) The commonness, and rarity, of species. Ecology 29:254–283. [88] Puente XS et al. (2011) Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475:101–105. [89] Ren J et al. (2016) Inference of markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics 32:993–1000. [90] Robins HS et al. (2009) Comprehensive assessment of T-cell receptor-chain diversity in T cells. Blood 114:4099–4107. [91] Rutishauser H (1954) Der quotienten-differenzen-algorithmus. Z. angew. Math. Physik 5:233–251. [92] Sampford MR (1955) The truncated negative binomial distribution. Biometrika 42:58–69. [93] Sanders SJ et al. (2012) De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485:237–241. [94] Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74:5463–5467. [95] Sarin S et al. (2008) Caenorhabditis elegans mutant allele identification by whole-genome sequencing. Nature Methods 5:865–867. 85 [96] Sboner A et al. (2011) The real cost of sequencing: higher than you think! Genome Biol- ogy 12:125. [97] Sichel HS (1975) On a distribution law for word frequencies. Journal of the American Statistical Association 70:542–547. [98] Sims D et al. (2014) Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics 15:121–132. [99] Sober´ onM. J, LlorenteB J (1993) The use of species accumulation functions for the prediction of species richness. Conservation Biology 7:480–488. [100] Solow AR, Polasky S (1999) A quick estimator for taxonomic surveys. Ecol- ogy 80:2799–2803. [101] Soon WW, Hariharan M, Snyder MP (2013) High-throughput sequencing for biology and medicine. Molecular Systems Biology 9. [102] Tarazona S et al. (2011) Differential expression in RNA-seq: A matter of depth. Genome research 21:2213–2223. [103] Tennessen JA et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337:64–69. [104] Van Der Heijden PG, Cruyff M, Van Houwelingen HC (2003) Estimating the size of a criminal population from police records using the truncated Poisson regression model. Statistica Neerlandica 57:289–304. [105] Venables B, Hornik K, Maechler M (2016) polynom: A collection of functions to implement a class for univariate polynomial manipulations. [106] Venter JC et al. (2001) The sequence of the human genome. Science 291:1304–1351. [107] WANG JP (2010) Estimating species richness by a Poisson-compound gamma model. Biometrika 97:727–740. [108] Wang JPZ, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. Journal of the American Statistical Association 100:942–959. [109] Wang K et al. (2014a) Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer. Nature Genetics 46:573–582. [110] Wang Y et al. (2014b) Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512:155–160. [111] Wedderburn LR et al. (2001) The developing human immune system: T-cell receptor reper- toire of children and young adults shows a wide discrepancy in the frequency of persistent oligoclonal T-cell expansions. Immunology 102:301–309. 86 [112] Yatsunenko T et al. (2012) Human gut microbiome viewed across age and geography. Nature 486:222–227. [113] Yu KH, Snyder M (2016) Omics profiling in precision oncology. Molecular & Cellular Proteomics 15:2525–2536. [114] Zafarani R, Liu H (2009) Social computing data repository at ASU http:// socialcomputing.asu.edu. [115] Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18:821–829. [116] Zipf GK (1935) The Psycho-biology of Language Houghton Mifflin, Boston. [117] Zipf GK (1949) Human Behavior and the Principle of Least Effort Addison-Wesley, Cam- bridge. [118] Zong C, Lu S, Chapman AR, Xie XS (2012) Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338:1622–1626. [119] Zou J et al. (2016) Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature communications 7:13293. 87 Appendix A Proofs A.1 Proof of Theorem 3.1 THEOREM 3.1. LetS r (t) denote the number of species represented at leastr times fort units of time. For any positive integerr, E[S r (t)] = (1) r1 t r (r 1)! d r1 dt r1 E[S 1 (t)] t : Proof. The theorem is trivially satisfied forr = 1, so we assumer 2. By equation (3.1) and linearity of expectation, E[S r (t)] = E[S 1 (t)] r1 X j=1 E[N j (t)]: Equation (3.2) shows how to eliminate the E[N j (t)] and write E[S r (t)] in terms of derivatives of E[S 1 (t)]: E[S r (t)] = E[S 1 (t)] r1 X j=1 (1) j1 t j j! d j dt j E[S 1 (t)] = r1 X j=0 (1) j t j j! d j dt j E[S 1 (t)]: (A.1) Extracting a factor oft r from the summands results in E[S r (t)] =t r r1 X j=0 (1) j j! 1 t rj d j dt j E[S 1 (t)]: 88 Note that 1=t rj can be expanded as 1 t rj = (1) r1j (r 1j)! d r1j dt r1j 1 t ; which suggests a substitution in equation (A.1). Eliminating the factor 1=t rj and introducing the (r 1j) th derivative of 1=t, we arrive at: t r r1 X j=0 (1) j j! 1 t rj d j dt j E[S 1 (t)] =t r r1 X j=0 (1) j j! (1) r1j (r 1j)! d j dt j E[S 1 (t)] d r1j dt r1j 1 t =t r r1 X j=0 (1) r1 (r 1)! r 1 j d j dt j E[S 1 (t)] d r1j dt r1j 1 t = (1) r1 t r (r 1)! r1 X j=0 r 1 j d j dt j E[S 1 (t)] d r1j dt r1j 1 t : Finally, by noticing that the above summation has the form of the general Leibniz rule, E[S r (t)] = (1) r1 t r (r 1)! r1 X j=0 r 1 j d j dt j E[S 1 (t)] d r1j dt r1j 1 t = (1) r1 t r (r 1)! d r1 dt r1 E[S 1 (t)] t : A.2 Proofs related to Theorem 3.2 In order to prove Theorem 3.2, we require two lemmas. In Lemma 3.1, we establish the power series representation for the average discovery rate. In Lemma 3.2 we give a sufficient condition for the existence of the Pad´ e approximant to(t), which is the power series estimator defined in equation (3.5) for the average discovery rate. LEMMA 3.1. If 0<t< 2, then E[S 1 (t)] t = 1 X i=0 (1) i (t 1) i E[S i+1 ]: 89 Proof. The expected number of species fort units of time is E[S 1 (t)] =L Z (1e t )dG() =L Z (1e (t1) e )dG(): Replacing exp((t 1)) with its power series yields E[S 1 (t)] =L Z 1 1 X i=0 ((t 1)) i i! e ! dG() =L Z 1e dG() 1 X i=1 (1) i (t 1) i L Z i i! e dG(): Note that the expected number of species in the initial sample is E[S 1 ] =L Z 1e dG() and the expected number of species observedj times in the initial sample is E[N j ] =L Z j e j! dG(): Therefore, the expected value ofS 1 (t) can be expressed as E[S 1 (t)] = E[S 1 ] + 1 X i=1 (1) i1 (t 1) i E[N i ]: The expansion of 1=t, att = 1, is 1 t = 1 1 +t 1 = 1 (t 1) + (t 1) 2 90 for 0<t< 2. Replacing both E[S 1 (t)] andt with their power series yields E[S 1 (t)]=t = E[S 1 ] + 1 X i=1 (1) i1 (t 1) i E[N i ] ! 1 X i=0 (1) i (t 1) i ! = E[S 1 ] + 1 X i=1 (1) i (t 1) i E[S 1 ] i X j=1 E[N j ] ! = 1 X i=0 (1) i (t 1) i E[S i+1 ]: The following lemma shows the existence of the Pad´ e approximant to(t). The conditions used in the lemma deviate from the classic result [48, Theorem 3.3] in two ways. First, the condition m;m 6= 0 is added to ensure the leading coefficient in the denominator in the Pad´ e approximant is nonzero. This condition is not required for proving the existence of the Pad´ e approximant, which is shown in the first part of the proof. The condition is used to proveb m 6= 0, which is needed in Theorem 3.2. Second, entries in i;j are not exactly coefficients of(t), but their absolute values. In the second part of the proof, we show two forms of determinants are equal. LEMMA 3.2. If the determinants m1;m and m;m are nonzero, there exist real numbersa i and b j fori = 0;:::;m 1 andj = 1; 2;:::;m, withb m 6= 0, such that the rational function P m1 (t) Q m (t) = a 0 +a 1 (t 1) + +a m1 (t 1) m1 1 +b 1 (t 1) + +b m (t 1) m satisfies (t) P m1 (t) Q m (t) =O (t 1) 2m ; where the determinant i;j is defined in (3.7) and the truncated power series (t) is defined in equation (3.5). Further, alla i andb j are uniquely determined byS 1 ;S 2 ;:::;S 2m . 91 Proof. For the sake of convenience, we use b 0 to denote 1. Since the denominator Q m (t) is nonzero, by multiplyingQ m (t) on both sides of expression (3.8), the expression becomes P m1 (t)(t)Q m (t) =O (t 1) 2m : Substitute the expression (3.5) for(t). The above condition is equivalent to two sets of system equations: b m (1) 0 S 1 +b m1 (1) 1 S 2 + +b 0 (1) m S m+1 = 0 b m (1) 1 S 2 +b m1 (1) 2 S 3 + +b 0 (1) m+1 S m+2 = 0 . . . (A.2) b m (1) m1 S m +b m1 (1) m S m+1 + +b 0 (1) 2m1 S 2m = 0; and a 0 = (1) 0 S 1 a 1 = (1) 1 S 2 +b 1 (1) 0 S 1 . . . (A.3) a m1 = (1) m1 S m + m1 X i=1 b i (1) mi1 S mi : Clearly once b j are determined for j = 1; 2;:::;m, the values of a j for j = 0; 1;:::;m 1 can be calculated through the linear system equations (A.3). Thus the existence of the rational functionP m1 (t)=Q m (t) solely depends on the solution of the linear system equations (A.2). Using 92 Cramer’s rule, the linear system equations (A.2) have unique solutions if and only if the following determinant is nonzero (1) 0 S 1 (1) 1 S 2 ::: (1) m1 S m (1) 1 S 2 (1) 2 S 3 ::: (1) m S m+1 . . . . . . . . . . . . (1) m1 S m (1) m S m+1 ::: (1) 2m2 S 2m1 mm : (A.4) We can simplify the form of this determinant. Using the permutation definition of a determi- nant, the above determinant can be expressed as X sgn()a 1;(1) a 2;(2) a m;(m) ; where is a permutation for the setf1; 2;:::;mg anda i;j is (1) i+j S i+j+1 . We can write the sum as X sgn()a 1;(1) a 2;(2) a m;(m) = X sgn() (1) P m i=1 i+(i) ja 1;(1) jja 1;(1) jja m;(m) j: Since the sum of(i) is equal to the sum ofi, we have (1) P m i=1 i+(i) = 1: As a result, the determinant (A.4) is equal to m1;m = S 1 S 2 ::: S m S 2 S 3 ::: S m+1 . . . . . . . . . . . . S m S m+1 ::: S 2m1 mm : 93 Therefore, the Pad´ e approximant to the power series (t) exists, provided m1;m 6= 0. From system equations (A.2) and (A.3), it is clear that thea i andb j are uniquely determined by 1S r 2m. Note thatb m can be solved as (1) m m;m = m1;m , which is nonzero by assumptions. Based on Lemma 3.1 and 3.2, we propose our estimator forr-SAC in Theorem 3.2 in the main text with a proof. Recall that in the proof, we assume that the denominators of all rational functions of interest have simple roots. Here we release this constraint so that the multiplicity of a root of the denominator in a rational function could be greater than 1. In general, the Pad´ e approximant P m1 (t)=Q m (t) can be represented as P m1 (t)=Q m (t) = X i c i1 tx i + c i2 (tx i ) 2 + + c ik i (tx i ) k i ; where x i is the root of multiplicity k i for Q m (t) = 0 and the sum of k i is m. So the estimator r;m (t) can be expressed as r;m (t) = X i r 1 r 1 c i1 t r (tx i ) r + r r 1 c i2 t r (tx i ) r+1 + + r 2 +k i r 1 c ik i t r (tx i ) r1+k i : In particular, ifk i = 1 for alli, i.e. all roots ofQ m (t) = 0 are simple, the estimator r;m (t) can be expressed as (3.9). For real applications, we rarely met Q m (t) with a repeat root. Therefore unless we say otherwise, we assume denominators of RFA have only simple roots. We use r;m (t) to represent the estimator of the form r;m (t) = m X i=1 c i t tx i r : 94 A.3 Proofs of Proposition 3.1 PROPOSITION 3.1. The following hold for the estimator r;m (t) of Theorem 3.2: (i) The estimator r;m (t) is unbiased for E[S r (t)] att = 1 forr 2m. (ii) The estimator r;m (t) converges ast approaches infinity. In particular, lim t!1 r;m (t) = m1;m+1 m;m : (iii) Assume that an initial sample is obtained after trapping fort 0 units of time, wheret 0 belongs to f1; 2;:::g. The estimator r;m (t) based on the sample is strongly consistent ast 0 goes to infinity. Proof. (i) The result is directly derived from Theorem 3.2. (ii) Whent goes to infinity, the ratiot=(tx i ) goes to 1, so r;m (t) converges to P c i for allr. Recall that the Pad´ e approximant to the power series(t) can be expressed P m1 (t)=Q m (t) = m X i=1 c i tx i : by equation (3.13). Thus P c i is equal to the ratiotP m1 (t)=Q m (t) ast goes to infinity. Using the determinant representation of P m1 (t)=Q m (t) in equation (3.6), we immediately obtain that the sum ofc i is equal to m1;m+1 = m;m . (iii) LetA t 0 denote the event thatS k <L after trapping fort 0 units of time and A(i:o:) =fA t 0 occurs for infinitely manyt 0 g: In the following we show Pr(A(i:o:)) = 0; with probability 1 only a finite number of eventsA t 0 occur. According to our modeling assumptions, the probability that species i is observed fewer thank times in the initial sample is q i;k;t 0 = k1 X l=0 ( i t 0 ) l l! exp( i t 0 ) (A.5) 95 Since Pr(A t 0 ) = Pr [ L i=1 speciesi is observed fewer thank times L X i=1 Pr speciesi is observed fewer thank times = L X i=1 q i;k;t 0 ; the sum of Pr(A t 0 ) satisfies 1 X t 0 =1 Pr(A t 0 ) 1 X t 0 =1 L X i=1 q i;k;t 0 = L X i=1 1 X t 0 =1 q i;k;t 0 : For any giveni, one can verify 1 X t 0 =1 q i;k;t 0 <1: As the sum of Pr(A t 0 ) is finite, by the Borel-Cantelli Lemma Pr(A(i:o:)) = 0. By definition S k S k1 S 1 L. So if we considerk = 2m, with probability 1 there exists anN 0 such thatS j =L for allj 2m whent 0 >N 0 . Next we show that the estimator r;m (t) isL wheneverS j =L forj 2m andt> 0. In the special casem = 1, the estimator r;m (t) can be checked directly: r;m (t) = S 2 1 S 2 t t + (S 1 S 2 )=S 2 r =L: Form> 1, we must establish that the Pad´ e approximantP m1 (t)=Q m (t) reduces toL=t, so that r;m (t) = (1) r1 t r (r 1)! d r1 dt r1 L t =L: LetP m1 (t)=Q m (t) denote the Pad´ e approximant to(t) with numeratorm 1 and denomi- natorm P m1 (t) Q m (t) = a 0 +a 1 (t 1) + +a m1 (t 1) m1 1 +b 1 (t 1) + +b m (t 1) m ; 96 where a l and b k are real numbers. Recall that the Pad´ e approximant exists exactly when both systems of equations (A.2) and (A.3) have solutions withb 0 = 1. WhenS j = L forj 2m, the system (A.2) degenerates to one equation b m b m1 + +b 0 (1) m = 0: (A.6) Given any values for b 1 ;b 2 ;:::;b m1 , the values of a 0 ;a 1 ;:::;a m1 and b m are uniquely deter- mined by (A.3) and (A.6). Based on the system of equations (A.3), the polynomialP m1 (t) can be expressed as a 0 +a 1 (t 1) + +a m1 (t 1) m1 =b 0 L + (b 0 L +b 1 L) (t 1) + + b 0 (1) m1 L + m1 X k=1 b k (1) m1k L ! (t 1) m1 : Terms that include the factorb 0 can be collected as b 0 LL(t 1) + + (1) m1 L(t 1) m1 =b 0 L 1 (1) m (t 1) m 1 (1)(t 1) = L t (b 0 b 0 (1) m (t 1) m ): Similarly, for terms including factorsb k , withk = 1; 2;:::;m 1, b k L(t 1) k L(t 1) k+1 + + (1) mk1 L(t 1) m1 =b k (t 1) k L 1 (1) mk (t 1) mk t = L t b k (t 1) k (1) mk b k (t 1) m : 97 Thus, a 0 +a 1 (t 1) + +a m1 (t 1) m1 = L t b 0 +b 1 (t 1) + +b m1 (t 1) m1 m1 X k=0 b k (1) mk (t 1) m ! : Equation (A.6) provides the following simplification: m1 X k=0 b k (1) mk (t 1) m =b m (t 1) m : Therefore, the numerator polynomialP m1 (t) can be written a 0 +a 1 (t 1) + +a m1 (t 1) m1 = L t (b 0 +b 1 (t 1) + +b m (t 1) m ): The rational functionP m1 (t)=Q m (t) is then P m1 (t) Q m (t) = L t ; which holds for any choices ofb 1 ;b 2 ;:::;b m1 . As a result, the estimator r;m (t) becomes r;m (t) = (1) r1 t r (r 1)! d r1 dt r1 L t =L wheneverS 1 =S 2 = =S 2m =L andt> 0. LetB t 0 be the event r;m (t)6= L using random samples generated by trapping fort 0 units of time andB(i:o:) to be the event thatB t 0 occurs for infinite manyt 0 . SinceB(i:o) A(i:o) and Pr(A(i:o)) = 0, it follows that Pr(B(i:o:)) = 0. Therefore r;m (t) a:s !L ast 0 goes to infinity for anyt> 0. 98 Appendix B The power series estimator ofr-species accumulation curve In the main text, we have obtained the power series estimator (t), which is defined by equa- tion (3.5) for the average discovery rate. Instead of using the Pad´ e approximant to the average discovery rate, we could directly apply(t) to obtain a power series estimator forr-SAC. In this section, we derive the form of the power series estimator forr-SAC and discuss issues associated with this power series estimator. We use(t) and the formula (3.3) to derive a power series estimator forr-SAC. Substituting the power series(t) for E[S 1 (t)]=t in equation (3.3) and takingr th derivatives leads to the desired form: r (t) =t r 1 X i=0 (1) i (t 1) i r 1 +i r 1 S r+i (B.1) In particular, whenr = 1, 1 (t) =t 1 X i=0 (1) i S i+1 (t 1) i = 1 X i=0 (1) i S i+1 (t 1) i + 1 X i=0 (1) i S i+1 (t 1) i+1 =S 1 + 1 X i=1 (1) i1 (t 1) i N i ; which is Good-Toulmin estimator [45]. Since the observed S j are unbiased estimates for E[S j ], the following proposition is straightforward. 99 PROPOSITION B.1. For anyr 1, r (t) =t r 1 X i=0 (1) i (t 1) i r 1 +i r 1 S r+i (B.2) is an unbiased estimator for E[S r (t)]. In practice r (t) is a truncated power series. Even if we assume that the error in r (t) is acceptable whent is close to 1, we have no basis for anticipating acceptable behavior whent> 2. The radius of convergence for the expected value of r (t) is 1, even though the quantity E[S r (t)] itself is well defined. For any fixedr the factor r 1 +i r 1 in (B.2) has polynomial growth withi, and can add extreme weight to each successive termS i+r . This could cause a large error in the predictions when using the observations S i+r in the initial sample to estimate its expectation E[S i+r ]. Let j max be the largest j such that observed S j > 0 so S j 0 = 0 for all j 0 > j max . For any r j max the estimator r (t) is a polynomial of degreej max . This function will tend to oscillate up to a point where the term with largest degree starts to dominate. Depending on whether the numberj max r even or odd, the estimate from r (t) might become very large, or could even take negative values. The problem with the truncated power series motivates us to use the rational function approxi- mation, in particular the Pad´ e approximant. The rational function approximation often gives better estimation than truncated power series when estimating an alternating power series. They may even converge when the Taylor series fails to converge. The following example is taken from [2]: f(t) = 1 + 2t 1 +t 1=2 = 1 + 1 2 t 5 8 t 2 + 13 16 t 3 141 128 t 4 + : 100 This Taylor series diverges fort> 1=2, even though the functionf(t) is well-defined over [0;1). In particular,f(t)! p 2 ast!1. Through a change of variable equatingt =w=(1 2w), we can rewrite f(t) = (1w) 1=2 = 1 + 1 2 w + 3 8 w 2 + 35 128 w 4 + : (B.3) In terms of the original variable t, the expression (B.3) induces a sequence of rational function approximations tof(t): 1; 1 + (5=2)t 1 + 2t ; 1 + (9=2)t + (43=8)t 2 (1 + 2t) 2 ; :::: Ast approaches infinity, these successive approximations forf(t) converge to p 2. 101 Appendix C Generalizing estimators of SAC tor-SAC The formula (3.3) in Theorem 3.1 provides a powerful tool to generalize existing estimator for the SAC and obtain estimators for ther-SAC. All estimators below are derived through substituting an estimator of SAC for E[S 1 (t)] in the formula (3.3) except for the logseries estimator, which is implied by [39]. The first two parametric estimators have been discussed in Section 3.1. We directly give the result without derivation. Zero-truncated Poisson estimator. The number of species E[S r (t)] represented at leastr times can be estimated as ^ E[S r (t)] = S 1 1e 1 r1 X i=0 (t) i i! e t ! ; (C.1) where satisfies 1e = P 1 i=1 iN i S 1 ; and is the MLE of zero-truncated Poisson distribution. Zero-truncated negative binomial estimator. The number of species E[S r (t)] represented at leastr times can be estimated as ^ E[S r (t)] = S 1 1 (1 +) 1 r1 X i=0 (i +) (i + 1)() t 1 +t i 1 1 +t ! ; (C.2) where and are fit by an expectation-maximization algorithm, with unobserved counts as the missing data. 102 The estimator of Boneh, Boneh and Caron (1998). We refer to this estimator as BBC. The expected number of species E[S 1 (t)] represented at least once is estimated as ^ E[S 1 (t)] =S 1 + 1 X i=1 N i e i 1e i(t1) +U e N 1 =U e N 1 t=U ; wheret 1 andU is the solution of the equation U 1e N 1 =U = 1 X i=1 N i e i ; if the conditionN 1 > P 1 i=1 N i e i is satisfied [6]. The number of species E[S r (t)] represented at leastr times can be estimated as ^ E[S r (t)] = (1) r1 t r (r 1)! d r1 dt r1 S 1 + P 1 i=1 N i e i 1e i(t1) +U e N 1 =U e N 1 t=U t ! =S 1 + 1 X i=1 N i e i r1 X k=0 (it) k k! e it ! +U e N 1 =U r1 X i=0 (N 1 t=U) i i! e N 1 t=U ! : The estimator of Chao and Shen (2004). We refer to this estimator as CS. The expected number of species E[S 1 (t)] represented at least once is estimated as ^ E[S 1 (t)] =S 1 +N 0 1e N 1 (t1)=N 0 ; wheret 1 [15]. Details for estimatingN 0 can be found by [14]. The number of species E[S r (t)] represented at leastr times can be estimated as ^ E[S r (t)] = (1) r1 t r (r 1)! d r1 dt r1 S 1 +N 0 1e N 1 (t1)=N 0 t ! =S 1 +N 0 1 r1 X i=0 (N 1 t=N 0 ) i i! e N 1 (t1)=N 0 ! : 103 The logseries estimator. The log series estimator for E[S r (t)] is based on results by [39]. According to the article, the expected number of species represented j times in t units of sam- pling effort can be estimated as j x j t ; where x t = Nt +Nt : The parameter is estimated by solving the equation S 1 = log 1 + N : The number of species represented at leastr times int units of sampling effort is then approximated by ^ E[S r (t)] = 1 X i=r i x i t = Z xt 0 x r1 x 1 dx: 104 Appendix D Sample coverage Assume a sample ofN individuals is drown from a population. Under our statistical assumption, the number of individuals for each species in the sample follows a multinomial distribution Mult(N;p 1 ;p 2 ;:::;p L ): The probability of an individual belonging to the speciesi is p i = i P L i=1 i : It is clear that the sum ofp i is equal to 1. The sample coverage is defined as C(p) = L X i=1 p i i ; (D.1) where i is the indicator of whether or not speciesi belongs to the sample. If speciesi belongs to the sample, the function value is 1; otherwise it is 0. For a random sample of sizeN, the probability of speciesi not in the sample is (1p i ) N . Therefore the expectation of sample coverage can be expressed as E(C(p)) = L X i=1 p i 1 (1p i ) N = 1 L X i=1 p i (1p i ) N : PROPOSITION D.1. For a random sample of fixed sizeN, the expectation of the sample coverage C(p) has the minimum value when allp i are equal to 1=L, providedN + 1<L. Proof. Our proof contains two parts. First we show that there exists only one extreme point for E(C(p)), which hasp 1 =p 2 = =p L = 1=L. Second we prove that the minimum value can not 105 be archived at the boundary. Therefore the extreme point must correspond to the minimum value of E(C(p)). Recall the expectation of sample coverage can be expressed as E(C(p)) = L X i=1 p i 1 (1p i ) N = 1 L X i=1 p i (1p i ) N ; where 0p i 1 fori = 1; 2;:::;L and L X i=1 p i = 1: To obtain the extreme values of E(C(p)), we apply the method of Lagrange multipliers f 1 (p 1 ;p 2 ;:::;p L ;) = 1 L X i=1 p i (1p i ) N + L X i=1 p i 1 ! : Set the partial derivative off 1 for the variablep i to be zero: @f 1 @p i =(1p i ) N +Np i (1p i ) N1 + = 0 fori = 1; 2;:::;L. For the convenience of analysis, we write the above equation as (1p i ) N Np i (1p i ) N1 =: (D.2) Now we analyze the roots of equation (D.2). Letg 1 (x) be the function g 1 (x) = (1x) N Nx(1x) N1 : 106 The derivative ofg 1 (x) is d dx g 1 (x) =N(1x) N1 N(1x) N1 +N(N 1)x(1x) N2 =N(1x) N2 ((N + 1)x 2): From the derivative, the function g 1 (x) is decreasing for x2 [0; 2=(N + 1)]. It then becomes increasing forx2 [2=(N + 1); 1]. At endpoints, the functiong 1 (x) hasg 1 (0)> 0,g 1 (1) = 0 and g 1 (2=(N + 1))< 0. Therefore equation (D.2) has at most one root if> 0 and at most two roots if 0. We consider these two conditions separately. For the case where > 0, since equation (D.2) has at most one root, allp i must be equal to that solution. Note that the sum ofp i is 1. We immediately obtain p 1 =p 2 = =p L = 1=L (D.3) and = 1 1 L N N L 1 1 L N1 ; (D.4) if the extreme point exists. To confirm this is actually a valid solution, we need to show > 0. Using the condition (N + 1)<L, we then have = 1 1 L N N L 1 1 L N1 = 1 1 L N1 1 N + 1 L > 0: Thus when> 0, the expectation of the sample coverageC(p) has only one extreme point. Now we consider the case 0 where equation (D.2) could have two roots. Note thatg 1 (1=(N +1)) = 0. Thusg 1 (x) > 0 whenx is between 0 and 1=(N + 1). Ifp i is the solution of equation (D.2), it must satisfy p i 1 N + 1 : 107 Suppose there exists an extreme point (p 1 ;p 2 ;:::;p L ) when 0. One should have L X i=1 p i L N + 1 : By conditionN + 1<L, we obtain L X i=1 p i > 1; which contradicts the fact that the sum should be 1. Therefore when 0, no extreme points exist. Combining two cases> 0 and 0, we conclude that (D.3) is the only extreme point for E(C(p)). At this point the value of E(C(p)) is 1 1 1 L N : (D.5) Next we use mathematical induction to show that the minimum value of E(C(p)) can not be archived at the boundary forL 2. As a result, the extreme value (D.5) is actually the minimum value of E(C(p)). The statement is trivial whenL = 2. Assume that whenL = k, the minimum value of E(C(p)) does not lie in the boundary. So E(C(p)) must archive the minimum at an extreme point. Since there exists only one extreme point, whenL =k, the expectation of sample coverageC(p) obtains the minimum value 1 1 1 k N ; providedp i = 1=k fori = 1; 2;:::;k. Now consider the caseL = k + 1. Suppose E(C(p)) has the minimum value at the boundary. Without loss of generality, letp k+1 = 0. Then the expectation of sample coverageC(p) can be expressed as E(C(p)) = 1 k X i=1 p i (1p i ) N ; 108 which becomes the caseL =k. As we have discussed, whenL =k the minimum value is 1 1 1 k N : However, ifp i = 1=(k + 1) fori = 1; 2;:::;k + 1, the expectation of sample coverage E(C(p)) becomes 1 1 1 k + 1 N ; which is smaller than 1 1 1 k N : Therefore the minimum value of the expectation of sample coverage can not lie at the boundary forL =k + 1. By mathematical induction, this statement is true for allL 2. We conclude that the extreme value (D.5) must be the minimum value of E(C(p)). 109 Appendix E Supplementary materials Table E.1: Relative errors for estimating genome coverage in WES. The relative errors for esti- mating the number of nucleotides in the exome covered by at leastr reads. The relative error was defined as the difference between the estimate and the true numbers of nucleotides covered by at least r reads, divided by the true. The true value was obtained by subsampling the full dataset. Each entry contains the relative error along with its 95% confidence interval. accession id r = 4 2 5 10 20 SRR1301329 -0.004 (-0.009, 0.001) 0.004 (-0.012, 0.021) -0.035 (-0.056, -0.013) -0.037 (-0.066, -0.007) SRR1301343 0.065 ( 0.005, 0.129) -0.062 (-0.106, -0.015) -0.068 (-0.118, -0.016) -0.018 (-0.031, -0.004) SRR1301345 -0.009 (-0.016, -0.002) 0.010 (-0.009, 0.029) -0.023 (-0.056, 0.011) -0.039 (-0.097, 0.023) SRR1301347 0.012 (-0.005, 0.029) -0.068 (-0.100, -0.035) -0.052 (-0.073, -0.031) -0.016 (-0.022, -0.010) SRR1301348 0.062 ( 0.002, 0.126) -0.059 (-0.103, -0.014) -0.063 (-0.115, -0.008) -0.017 (-0.032, -0.001) ERR315871 -0.002 (-0.030, 0.026) -0.014 (-0.057, 0.031) -0.019 (-0.052, 0.015) -0.012 (-0.024, 0.000) ERR315877 -0.001 (-0.005, 0.003) -0.017 (-0.036, 0.002) -0.033 (-0.068, 0.004) -0.023 (-0.114, 0.077) ERR315885 0.029 ( 0.018, 0.040) -0.060 (-0.077, -0.042) -0.054 (-0.069, -0.039) -0.006 (-0.009, -0.004) ERR315904 -0.003 (-0.006, 0.000) -0.010 (-0.025, 0.005) -0.028 (-0.047, -0.009) -0.024 (-0.088, 0.046) ERR315908 0.021 ( 0.004, 0.039) -0.039 (-0.056, -0.021) -0.055 (-0.077, -0.032) -0.030 (-0.036, -0.023) N701-S3 0.019 (-0.009, 0.048) -0.003 (-0.020, 0.014) -0.038 (-0.069, -0.005) -0.049 (-0.076, -0.021) N701-S4 0.013 ( 0.006, 0.020) 0.006 (-0.007, 0.018) -0.029 (-0.044, -0.014) -0.048 (-0.078, -0.017) N707-S2 0.038 ( 0.036, 0.039) -0.054 (-0.057, -0.051) -0.081 (-0.089, -0.074) -0.067 (-0.084, -0.051) N707-S3 0.040 ( 0.010, 0.071) -0.055 (-0.082, -0.027) -0.085 (-0.123, -0.045) -0.074 (-0.097, -0.050) N707-S4 0.006 ( 0.003, 0.009) -0.020 (-0.028, -0.011) -0.047 (-0.062, -0.032) -0.057 (-0.085, -0.029) NA12878-5 -0.009 (-0.011, -0.007) -0.029 (-0.031, -0.026) -0.045 (-0.050, -0.040) -0.035 (-0.041, -0.029) NA12878-12 -0.006 (-0.008, -0.003) -0.028 (-0.034, -0.022) -0.047 (-0.064, -0.030) -0.038 (-0.064, -0.010) NA12878-16 0.018 ( 0.006, 0.030) -0.071 (-0.086, -0.055) -0.080 (-0.096, -0.063) -0.040 (-0.046, -0.034) NA12878-29 0.023 ( 0.010, 0.036) -0.068 (-0.086, -0.049) -0.078 (-0.093, -0.061) -0.038 (-0.044, -0.032) NA12878-41 -0.006 (-0.014, 0.003) -0.027 (-0.039, -0.015) -0.045 (-0.057, -0.033) -0.036 (-0.043, -0.028) 110 accession id r = 8 2 5 10 20 SRR1301329 0.123 (0.071, 0.179) -0.006 (-0.063, 0.054) 0.017 (-0.029, 0.066) -0.05 (-0.073, -0.026) SRR1301343 0.245 (0.026, 0.511) 0.142 (0.033, 0.263) -0.066 (-0.112, -0.019) -0.099 (-0.165, -0.027) SRR1301345 0.045 (-0.008, 0.101) -0.037 (-0.1, 0.031) 0.026 (-0.051, 0.11) -0.033 (-0.119, 0.061) SRR1301347 0.37 (0.11, 0.691) -0.026 (-0.038, -0.013) -0.108 (-0.151, -0.063) -0.096 (-0.131, -0.059) SRR1301348 0.229 (0.014, 0.489) 0.123 (0.02, 0.236) -0.066 (-0.107, -0.023) -0.097 (-0.165, -0.025) ERR315871 0.005 (-0.153, 0.192) -0.013 (-0.044, 0.019) -0.034 (-0.085, 0.02) -0.041 (-0.089, 0.01) ERR315877 0.012 (0.001, 0.023) -0.011 (-0.032, 0.011) -0.029 (-0.06, 0.002) -0.051 (-0.11, 0.012) ERR315885 0.18 (0.104, 0.261) 0.019 (0.004, 0.034) -0.084 (-0.108, -0.06) -0.088 (-0.11, -0.065) ERR315904 0.002 (-0.013, 0.017) -0.013 (-0.024, -0.002) -0.027 (-0.046, -0.009) -0.04 (-0.076, -0.003) ERR315908 -0.087 (-0.166, -0.002) 0.024 (-0.012, 0.06) -0.057 (-0.076, -0.039) -0.089 (-0.122, -0.056) N701-S3 0.126 (0.101, 0.151) 0.111 (0.056, 0.169) 0.031 (0.014, 0.049) -0.064 (-0.112, -0.015) N701-S4 0.14 (0.126, 0.155) 0.09 (0.066, 0.115) 0.045 (0.015, 0.077) -0.035 (-0.068, -0.001) N707-S2 0.261 (0.256, 0.265) 0.099 (0.092, 0.107) -0.053 (-0.07, -0.036) -0.131 (-0.162, -0.098) N707-S3 0.262 (0.115, 0.429) 0.097 (0.045, 0.151) -0.058 (-0.082, -0.035) -0.139 (-0.185, -0.09) N707-S4 0.09 (0.074, 0.106) 0.039 (0.02, 0.057) -0.022 (-0.046, 0.002) -0.085 (-0.116, -0.052) NA12878-5 -0.017 (-0.023, -0.012) -0.035 (-0.04, -0.029) -0.059 (-0.066, -0.052) -0.089 (-0.101, -0.076) NA12878-12 -0.022 (-0.031, -0.013) -0.032 (-0.042, -0.022) -0.057 (-0.078, -0.035) -0.091 (-0.106, -0.076) NA12878-16 0.242 (0.146, 0.345) -0.006 (-0.026, 0.014) -0.105 (-0.12, -0.089) -0.137 (-0.164, -0.11) NA12878-29 0.269 (0.192, 0.351) 0.001 (-0.009, 0.011) -0.1 (-0.113, -0.088) -0.134 (-0.15, -0.117) NA12878-41 -0.013 (-0.131, 0.12) -0.031 (-0.046, -0.015) -0.056 (-0.076, -0.036) -0.089 (-0.111, -0.065) accession id r = 10 2 5 10 20 SRR1301329 0.149 ( 0.014, 0.302) -0.069 (-0.139, 0.007) 0.047 ( 0.007, 0.087) -0.035 (-0.073, 0.003) SRR1301343 0.000 (-0.039, 0.040) 0.303 ( 0.080, 0.573) -0.010 (-0.020, 0.001) -0.111 (-0.181, -0.036) SRR1301345 0.012 (-0.098, 0.136) -0.052 (-0.128, 0.031) 0.026 (-0.038, 0.095) -0.012 (-0.099, 0.084) SRR1301347 0.556 ( 0.199, 1.018) 0.049 (-0.015, 0.118) -0.109 (-0.141, -0.076) -0.123 (-0.163, -0.080) SRR1301348 -0.012 (-0.032, 0.009) 0.164 (-0.045, 0.417) -0.012 (-0.023, -0.001) -0.078 (-0.148, -0.002) ERR315871 0.153 ( 0.020, 0.304) 0.080 (-0.009, 0.178) -0.067 (-0.101, -0.032) -0.106 (-0.163, -0.045) ERR315877 0.026 ( 0.002, 0.050) -0.020 (-0.051, 0.012) -0.026 (-0.064, 0.014) -0.054 (-0.120, 0.017) ERR315885 0.162 ( 0.105, 0.221) 0.083 ( 0.043, 0.124) -0.069 (-0.082, -0.055) -0.108 (-0.137, -0.079) ERR315904 0.017 (-0.018, 0.053) -0.019 (-0.039, 0.002) -0.026 (-0.048, -0.005) -0.049 (-0.067, -0.030) ERR315908 -0.250 (-0.457, 0.035) 0.044 (-0.008, 0.098) -0.039 (-0.043, -0.034) -0.100 (-0.130, -0.068) N701-S3 0.186 (-0.051, 0.483) 0.193 ( 0.104, 0.290) 0.110 ( 0.078, 0.143) -0.035 (-0.082, 0.014) N701-S4 0.206 ( 0.146, 0.270) 0.178 ( 0.145, 0.212) 0.096 ( 0.059, 0.134) 0.000 (-0.063, 0.067) N707-S2 0.281 ( 0.276, 0.287) 0.223 ( 0.213, 0.234) 0.013 (-0.010, 0.037) -0.126 (-0.166, -0.084) N707-S3 0.290 ( 0.167, 0.427) 0.219 ( 0.125, 0.321) 0.002 (-0.020, 0.024) -0.136 (-0.173, -0.098) N707-S4 0.118 ( 0.071, 0.168) 0.102 ( 0.080, 0.126) 0.008 (-0.021, 0.037) -0.082 (-0.133, -0.028) NA12878-5 -0.003 (-0.013, 0.007) -0.049 (-0.055, -0.044) -0.060 (-0.068, -0.051) -0.104 (-0.117, -0.091) NA12878-12 0.005 (-0.005, 0.015) -0.052 (-0.071, -0.033) -0.055 (-0.076, -0.034) -0.104 (-0.125, -0.082) NA12878-16 0.332 ( 0.241, 0.429) 0.043 ( 0.024, 0.063) -0.089 (-0.100, -0.077) -0.159 (-0.180, -0.138) NA12878-29 0.386 ( 0.205, 0.596) 0.054 ( 0.005, 0.106) -0.082 (-0.096, -0.067) -0.152 (-0.183, -0.120) NA12878-41 0.015 (-0.107, 0.153) -0.046 (-0.080, -0.010) -0.055 (-0.067, -0.043) -0.100 (-0.121, -0.078) 111 Table E.2: Relative errors for estimating genome coverage in scWGS. The relative errors for esti- mating the number of nucleotides in the genome covered by at leastr reads using the initial sam- ple of 5M reads. The relative error was defined as the difference between the estimate and the true numbers of nucleotides covered by at leastr reads, divided by the true. The true value was obtained by subsampling the full dataset. Each entry contains the relative error along with its 95% confidence interval. accession id r = 4 2 5 10 20 SRX2660674 0.000 (-0.002, 0.002) 0.000 (-0.008, 0.008) 0.002 (-0.009, 0.013) -0.003 (-0.047, 0.043) SRX2660675 -0.001 (-0.004, 0.001) -0.006 (-0.012, 0.001) -0.003 (-0.007, 0.001) 0.015 (-0.014, 0.045) SRX2660676 0.000 (-0.004, 0.003) -0.010 (-0.015, -0.006) 0.014 ( 0.007, 0.022) 0.052 ( 0.031, 0.072) SRX2660677 -0.003 (-0.005, -0.002) -0.016 (-0.017, -0.014) 0.000 (-0.005, 0.006) 0.037 ( 0.030, 0.045) SRX2660678 -0.003 (-0.006, 0.001) -0.009 (-0.012, -0.006) -0.004 (-0.007, -0.001) 0.027 ( 0.019, 0.035) SRX2660679 -0.002 (-0.006, 0.001) -0.003 (-0.016, 0.009) -0.005 (-0.016, 0.006) 0.007 (-0.034, 0.050) SRX2660680 0.001 (-0.001, 0.002) -0.008 (-0.011, -0.006) -0.001 (-0.004, 0.002) 0.015 ( 0.010, 0.020) SRX2660681 -0.001 (-0.004, 0.002) 0.006 (-0.008, 0.019) -0.002 (-0.016, 0.013) -0.018 (-0.050, 0.016) SRX2660682 -0.006 (-0.009, -0.003) -0.013 (-0.023, -0.002) 0.004 (-0.004, 0.012) 0.028 ( 0.001, 0.055) SRX2660683 -0.043 (-0.086, 0.001) 0.023 ( 0.011, 0.035) 0.030 ( 0.021, 0.039) -0.019 (-0.027, -0.011) SRX2660684 -0.027 (-0.053, -0.002) 0.012 (-0.004, 0.029) 0.008 (-0.008, 0.025) -0.024 (-0.045, -0.003) SRX2660685 -0.008 (-0.037, 0.021) 0.016 (-0.011, 0.044) 0.022 ( 0.003, 0.041) -0.038 (-0.058, -0.018) accession id r = 8 2 5 10 20 SRX2660674 -0.004 (-0.019, 0.010) 0.002 (-0.001, 0.004) 0.000 (-0.016, 0.016) 0.002 (-0.005, 0.009) SRX2660675 -0.005 (-0.009, -0.001) 0.002 (-0.005, 0.008) -0.006 (-0.014, 0.002) -0.008 (-0.020, 0.003) SRX2660676 -0.008 (-0.011, -0.005) 0.006 ( 0.004, 0.009) -0.022 (-0.033, -0.011) 0.003 (-0.005, 0.012) SRX2660677 -0.024 (-0.029, -0.019) 0.006 ( 0.004, 0.008) -0.021 (-0.028, -0.014) -0.016 (-0.022, -0.010) SRX2660678 -0.025 (-0.039, -0.011) 0.003 (-0.001, 0.007) -0.009 (-0.015, -0.003) -0.018 (-0.025, -0.010) SRX2660679 -0.019 (-0.034, -0.004) -0.005 (-0.020, 0.009) -0.002 (-0.023, 0.020) 0.001 (-0.037, 0.040) SRX2660680 -0.022 (-0.031, -0.014) 0.011 ( 0.006, 0.016) -0.011 (-0.015, -0.007) -0.010 (-0.018, -0.002) SRX2660681 0.003 (-0.018, 0.024) -0.004 (-0.018, 0.010) 0.006 (-0.008, 0.021) 0.004 (-0.026, 0.035) SRX2660682 -0.043 (-0.075, -0.009) 0.004 (-0.005, 0.013) -0.022 (-0.039, -0.005) -0.008 (-0.039, 0.024) SRX2660683 -0.186 (-0.429, 0.160) -0.122 (-0.215, -0.017) -0.069 (-0.106, -0.032) 0.091 ( 0.086, 0.096) SRX2660684 -0.150 (-0.275, -0.004) -0.119 (-0.295, 0.101) 0.000 (-0.074, 0.080) 0.053 ( 0.003, 0.105) SRX2660685 -0.024 (-0.191, 0.178) 0.019 (-0.182, 0.270) -0.108 (-0.247, 0.055) 0.110 ( 0.046, 0.178) accession id r = 10 2 5 10 20 SRX2660674 -0.002 (-0.010, 0.006) 0.001 (-0.029, 0.031) -0.001 (-0.032, 0.032) 0.000 (-0.013, 0.013) SRX2660675 -0.004 (-0.008, 0.001) 0.001 (-0.004, 0.007) -0.004 (-0.006, -0.001) -0.008 (-0.022, 0.006) SRX2660676 -0.014 (-0.023, -0.006) 0.014 ( 0.008, 0.020) -0.015 (-0.025, -0.006) -0.015 (-0.018, -0.011) SRX2660677 -0.046 (-0.065, -0.028) 0.007 ( 0.002, 0.013) -0.009 (-0.020, 0.002) -0.029 (-0.033, -0.026) SRX2660678 -0.031 (-0.063, 0.002) -0.001 (-0.007, 0.005) 0.002 (-0.004, 0.007) -0.023 (-0.032, -0.015) SRX2660679 -0.035 (-0.061, -0.008) -0.003 (-0.011, 0.005) 0.002 (-0.015, 0.020) 0.003 (-0.033, 0.040) SRX2660680 -0.036 (-0.050, -0.022) 0.009 ( 0.001, 0.017) 0.001 (-0.004, 0.007) -0.018 (-0.022, -0.014) SRX2660681 -0.004 (-0.023, 0.015) -0.002 (-0.010, 0.006) -0.005 (-0.020, 0.010) 0.016 (-0.001, 0.034) SRX2660682 -0.079 (-0.140, -0.014) 0.008 (-0.010, 0.025) -0.012 (-0.022, -0.002) -0.022 (-0.025, -0.018) SRX2660683 -0.265 (-0.514, 0.111) -0.026 (-0.187, 0.167) -0.198 (-0.270, -0.118) 0.085 ( 0.010, 0.164) SRX2660684 -0.163 (-0.372, 0.116) -0.022 (-0.260, 0.293) -0.126 (-0.303, 0.095) 0.060 (-0.029, 0.157) SRX2660685 -0.016 (-0.240, 0.275) 0.101 (-0.112, 0.364) -0.189 (-0.494, 0.302) 0.081 (-0.020, 0.193) 112
Abstract (if available)
Abstract
The development of high‐throughput DNA sequencing technology has fundamentally changed biological science and started to impact clinical practice. Although costs have quickly fallen, sequencing experiments remain expensive to conduct, especially large sequencing projects that involve many individuals. This work develops a method to predict the optimal amount of sequences for single‐cell DNA sequencing experiments based on shallow sequencing experiments, which are at low costs. Apart from addressing the issues of cost and size, our statistical method characterizes high‐throughput DNA sequencing libraries in relation to coverage and dropout events based on a very small amount of sequencing from pilot studies. Sequencing centers and researchers can easily adopt this method to optimize their experimental designs and, consequently, establish a practical foundation for the development of efficient and cost‐effective sequencing protocols.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Peripheral blood mononuclear cell capture and sequencing: optimization of a droplet based capture method and its applications
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Application of machine learning methods in genomic data analysis
PDF
Exploring the genetic basis of complex traits
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Machine learning of DNA shape and spatial geometry
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
The impact of Human Genome Project, International HapMap Project and next generation sequencing to the R&D process of pharmaceutical industry
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
Asset Metadata
Creator
Deng, Chao
(author)
Core Title
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
04/05/2018
Defense Date
02/16/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
genomic sequencing,mixture of Poisson distributions,OAI-PMH Harvest,Padé approximant,sequencing coverage,single‐cell whole‐genome sequencing,species accumulation curve,whole‐exome sequencing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew (
committee chair
), Minsker, Stanislav (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
chaodeng@usc.edu,chaodengusc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-491074
Unique identifier
UC11266390
Identifier
etd-DengChao-6150.pdf (filename),usctheses-c40-491074 (legacy record id)
Legacy Identifier
etd-DengChao-6150.pdf
Dmrecord
491074
Document Type
Dissertation
Rights
Deng, Chao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
genomic sequencing
mixture of Poisson distributions
Padé approximant
sequencing coverage
single‐cell whole‐genome sequencing
species accumulation curve
whole‐exome sequencing