Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Non-parametric models for large capture-recapture experiments with applications to DNA sequencing
(USC Thesis Other)
Non-parametric models for large capture-recapture experiments with applications to DNA sequencing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NON-PARAMETRIC MODELS FOR LARGE CAPTURE-RECAPTURE EXPERIMENTS WITH APPLICATIONS TO DNA SEQUENCING by Timothy Daley A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) December 2014 Copyright 2014 Timothy Daley We can only see a short distance ahead, but we can see plenty there that needs to be done. - Alan Turing. ii Contents Dedication ii ListofFigures v ListofAlgorithms vi 1 Introduction 1 2 Extrapolatinglibrarycomplexitycurves 5 2.1 The non-parametric empirical Bayes estimator of Good & Toulmin . . . 6 2.2 Rational function approximation . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Instabilities in rational function approximations . . . . . . . . . 12 2.3 Bootstrapping the histogram . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Reading in the data and constructing the counts histogram. . . . 15 2.4.2 Resampling the counts histogram to bootstrap the median curve. 16 2.5 Yield extrapolation results . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Extrapolatinggenomecoverage 22 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Accuracy of predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Binning to approximate coverage . . . . . . . . . . . . . . . . 28 3.3.2 Effects of read length on coverage estimates . . . . . . . . . . . 32 3.3.3 Does including duplicate reads impact predictions? . . . . . . . 34 4 Extrapolatingthemin-countcomplexitycurve 37 4.1 Deriving a non-parametric empirical Bayes estimator . . . . . . . . . . 37 4.2 Rational function approximations to the NPEB min-count estimator . . 39 4.3 Min-count extrapolation results . . . . . . . . . . . . . . . . . . . . . . 40 iii 5 Predictingsaturation 42 5.1 Using the derivative of the complexity curve to estimate the saturation . 43 5.2 Saturation extrapolation results . . . . . . . . . . . . . . . . . . . . . . 44 6 Lowerboundsonthenumberofunobserved 47 6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1.1 Bounding the number of unobserved . . . . . . . . . . . . . . . 49 6.1.2 Gaussian quadrature . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Calculating Gaussian quadrature rules . . . . . . . . . . . . . . . . . . 53 6.2.1 Monic orthogonal polynomials and the associated three term recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.2 Convergence of quadrature estimates . . . . . . . . . . . . . . 54 6.2.3 Analytic example: Negative Binomial counts . . . . . . . . . . 55 6.2.4 Calculating quadrature rules from the Jacobi matrix . . . . . . . 56 6.2.5 Calculating the three term recurrence . . . . . . . . . . . . . . 56 6.2.6 Bagging and variance stabilization . . . . . . . . . . . . . . . . 60 6.3 Comparing lower bounds for a simulated sequencing experiment . . . . 62 6.3.1 Simulating a RNA-seq experiment with known properties . . . 62 6.3.2 Comparing estimates . . . . . . . . . . . . . . . . . . . . . . . 63 ReferenceList 68 A Appendix 78 A.1 All binning strategies are biased . . . . . . . . . . . . . . . . . . . . . 78 A.2 A class of orthogonal polynomials that extend generalized Laguerre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.2.1 Rodrigues’ formula . . . . . . . . . . . . . . . . . . . . . . . . 82 A.2.2 Three-term recurrence . . . . . . . . . . . . . . . . . . . . . . 83 A.2.3 2nd order differential equation . . . . . . . . . . . . . . . . . . 85 iv ListofFigures 2.1 Divergence of the Good & Toulmin power series . . . . . . . . . . . . 8 2.2 Selecting the degree of the rational function approximation . . . . . . . 10 2.3 Defects in the rational function approximations to the Good & Toulmin power series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Bootstrapping the histogram to reduce error and variance . . . . . . . . 14 2.5 A crossing of complexity curves . . . . . . . . . . . . . . . . . . . . . 18 2.6 Example complexity curves for single cell whole genome, single cell whole exome, and bisulfite sequencing experiments . . . . . . . . . . . 20 2.7 Example complexity curves for single cell RNA-seq and ChIP-seq exper- iments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Genome coverage histogram schematic . . . . . . . . . . . . . . . . . 26 3.2 Relative error for simple, bootstrapped median, and bootstrapped mean extrapolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Toy example of probabilistic binning . . . . . . . . . . . . . . . . . . . 30 3.4 Bin size and estimated coverage curve, running time, and standard devi- ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Read length and error distribution for genome coverage extrapolation . 35 3.6 Genome coverage extrapolation with duplicates removed . . . . . . . . 36 4.1 Rational function approximations to the min-count curves for min-count 2 through 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Estimated saturation curves . . . . . . . . . . . . . . . . . . . . . . . . 46 6.1 Discrete approximations to a measure by Gaussian quadrature and the convergence of the quadrature estimates . . . . . . . . . . . . . . . . . 52 6.2 Three term recurrence and quadrature estimates distribution using mod- ified and unmodified Chebyshev algorithm . . . . . . . . . . . . . . . . 59 6.3 Bootstrapped arithmetic versus geometric mean quadrature estimates . . 61 6.4 Comparison of lower bounds on the number of total reads in a simulated library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5 Comparison of lower bounds on the number of total expressed exons in a simulated library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 v ListofAlgorithms 1 Quotient-Difference algorithm . . . . . . . . . . . . . . . . . . . . . . . 12 2 Golub & Welsch’s [43] linear time QR algorithm . . . . . . . . . . . . . 57 3 The modified Chebyshev algorithm . . . . . . . . . . . . . . . . . . . . 58 vi Chapter1 Introduction Consider an experiment where observations are sampled or arrive from an unknown population made up of a finite but unknown number of classes. The class membership of an observation can be determined upon sampling but the unobserved classes are com- pletely unknown. For example when capturing butterflies to assess the species diver- sity, it is not difficult to distinguish a new species of butterfly but hard to know before- hand what an unseen species of butterfly will look like. This is commonly known as a capture-recapture experiment [16]. The classical application is to estimate the diversity of species, i.e. the total number of distinct species, without exhaustive sampling [34]. Additional questions that can be posed are evaluation of the cost versus benefit of addi- tional sampling or continuing the experiment, where benefit is measured in number of distinct classes observed, or to estimate the relative abundance of classes, including the unobserved classes [46]. The generality of the capture-recapture model results in a wide range of applica- tions, a large number of which can be found in [13]. A few more recent examples of applications include estimating immunological repertoire [8], estimating genomic diver- sity [57], natural language modeling [104], and estimating bacterial diversity [67]. Our interest in capture-recapture arises in the context of high-throughput next- generation sequencing experiments. In these experiments a sequencing library is gener- ated by first obtaining genomic material, then possibly subjecting it to processing steps such as poly(A) selection for mRNA-seq or chromatin immunoprecipitation for ChIP- seq. The genomic material is amplified so that thousands or millions of copies of each 1 original genomic molecule is contained in the library. Finally the library is sequenced by capturing or sampling molecules and identifying portions of their genomic sequence. The obtained sequences have errors arising from miscalled bases or misincorporated nucleotides or mutations in the amplification process. This means that mapping to a reference genome or sequence correction is necessary to identify duplicate reads or molecules. The model we assume throughout is that the number of sequenced reads mapping to each position in the genome follows a Poisson process, with the possibility that some positions are not present in the library. If all positions are uniformly represented in the library then the number of sequenced reads from each position follows a Poisson distribution. This is commonly referred to as the Lander-Waterman model of sequenc- ing [68, 2], originally derived to enable planning of fingerprinting schemes for genome assembly. In such cases the genome was built by overlapping long genomic segments sequenced by Sanger technology. With a reference genome the need for overlap analysis is gone so that the distribution of the number of reads covering each position is a Pois- son. In the context of high-throughput sequencing, the uniformity assumption is lost due to a myriad of biases [94], requiring more complicated models that can accommodate the biases [114]. Additionally, there may be portions of genome not represented in the library. One example is in mRNA-seq, where regions not transcribed will be absent. In order to accommodate arbitrary biases and possible absence of molecules or loci in a sequencing library we shall consider a non-parametric Poisson model for sequenc- ing. We assume that the number of reads from each position follows a compound Poisson process with rates independently and identically distributed according to some unknown compounding distribution. Furthermore we assume the additional possibil- ity that some portions of the genome are missing from the library and can never be 2 sequenced. It is our goal to consider the compounding distribution to be as general as possible. Under this model we shall consider the following questions: What is the gain in new reads or new loci with deeper sequencing? What is the gain in the proportion of the genome covered with deeper sequencing? How many more reads or loci will be observed at leastk times for somek > 1 with deeper sequencing? How does the total relative abundance of the unobserved reads or loci change with deeper sequencing? How many total distinct reads or loci are contained or represented in the library? Here non-parametric empirical Bayes (NPEB) estimators for the first four questions are presented. These estimators are unbiased or unbiased for the identifiable portions of the estimated quantities. These estimators will take the form of power series in the variablet, that measures how far out the prediction is relative to the observed experiment and will be called the fold extrapolation. These estimators suffer from extreme variance and instability for large fold extrapolations due to being given in power series, typically alternating. We introduce rational function approximations to obtain globally stable estimates that are extremely accurate in practice. This allows us to obtain estimates far more accurate for much further extrapolations than previously considered. The last question is commonly known as the species diversity problem in the capture- recapture literature. This is a notoriously difficult problem without severe restrictions on the form of the compounding distribution. The number of unobserved reads is non- identifiable and therefore no unbiased estimator can exist in the non-parametric setting. It is non-identifiable even when considering the compounding distribution belonging to one of two families such as the gamma distribution and finite mixtures [71, 55]. By 3 reformulating the problem as bounding an integral over an unknown distribution given estimated moments of the unknown distribution, we can use Gaussian quadrature to obtain lower bounds that use significantly more information than previous methods. Note that most of the above questions can be asked in the context of general capture- recapture experiments, if one replaces reads or loci by species and sequencing by sam- pling. The difference is that sequencing experiments produce data orders of magnitude larger than in traditional capture-recapture experiments. Our algorithms will utilize the scale of high-throughput sequencing and so will be suitable for large capture-recapture experiments. We shall discuss the first question in chapter 2, that corresponds to the paper of Daley & Smith [24]. Chapter 3 discusses the second question and can also be found in the paper Daley & Smith [25]. The rest of the material is new. 4 Chapter2 Extrapolatinglibrarycomplexity curves A practical problem in high-throughput sequencing is to determine how deep to sequence a library. The exponential amplification of genomic molecules prior to sequencing means that there are many copies of each original molecule. Sequencing multiple reads originating from the same molecule is redundant but unavoidable [94]. The number of distinct reads observed as a function of total reads sequenced, also called the sequencing depth, is a convex curve. The more reads are sequenced, the more likely one will sequence a duplicate read. We will consider using an initial shallow sequencing experiment of a few million reads to extrapolate the number of distinct reads as the library is sequenced deeper. This problem is analogous to extrapolating the species accumulation curve in capture- recapture experiments [113], where one tries to extrapolate the number of distinct species observed as a function of individuals samples. The underlying Poisson model is identical so that we can treat the two as equivalent. All the methods will be presented in terms of reads and distinct reads but can equally be applied to the expected gain in new species captured in any capture-recapture problem, called the species accumulation problem [22]. 5 2.1 The non-parametric empirical Bayes estimator of Good&Toulmin Consider a sequencing experiment in which a total oftN reads are sequenced, for some 1 < t <1. We fix a sizeN subset of the reads and refer to this subset as the initial experiment. The remaining (t?1)N reads are sequenced from the same library in the extended experiment. We refer to the union of the initial and extended experiment as the complete experiment. We assume that the properties of the library are unchanged between the initial and extended experiments. Suppose that the library containsS distinct reads. Lety 1 (t);:::;y S (t) be the number of sequenced reads for every read in the library after sequencing a total oftN reads. We assume that y i (t) follows a compound Poisson process with rates independently and identically distributed for i = 1;:::;S. Let n j (t) be the number of reads observed j times at timet. This has expectation En j (t) =S Z 1 0 e t (t) j =j! d(): (2.1) We assume that the initial sequencing experiment corresponds to timet = 1 and define n j =n j (1). We call this the frequency of countj. Let (t) denote the marginal yield between the initial and complete experiments. The marginal yield is equal to the increase in the number of distinct observed reads 6 resulting from the extended experiment. This is equal to the expected number of unob- served reads following the initial experiment minus the expected number of unobserved reads following the complete experiment: (t) = En 0 (1) En 0 (t) =S Z 1 0 e 1e (t1) d() =S Z 1 0 e 1 X j=1 (1) j+1 (t 1) j j =j! d() = 1 X j=1 (1) j+1 (t 1) j S Z 1 0 e j =j!d() = 1 X j=1 (1) j+1 (t 1) j E n j (1) : (2.2) The last equality is obtained by invoking identity (2.1). Since the observed frequency of count i is always an unbiased estimator for En j regardless of the distribution , plugging in the observed frequencies for the expected frequencies in equation (2.2) gives an unbiased estimator for the marginal yield, ^ (t) = 1 X j=1 (1) j+1 (t 1) j n j : (2.3) An unbiased estimator of the total library yield can be calculated by adding the marginal yield and the observed number of distinct reads in the initial experiment. This elegant result, that we will call the Good & Toulmin estimator or power series after the original inventors [47], presents substantial difficulties in practical application. The range of absolute convergence of equation (2.2) can be calculated from 1 X j=1 (t 1) j E n j (1) =S Z 1 0 e (t2) e d(): 7 0 10 20 0 10 20 mapped reads (millions) distinct reads (millions) 0 20 40 0 20 40 intial experiment observed yield estimated yield by naive Good & Toulmin estimated yield by Euler’s transformation to Good & Toulmin Figure 2.1: Observed versus estimated complexity curve using naive application and Euler’s transformation to the Good & Toulmin power series obtained from a four million read initial experiment. The library is Human B-Cell bisulfite sequencing [54] mapped with RMAPBS [96]. Regardless of, this is guaranteed to converge fort 2 but may not converge for all fort> 2. In theory the sum (2.3) is infinite but in practice it is finite, with all entries beyond the largest duplicate count equal to 0. The largest observed duplicate count will deter- mine the long run behavior when applying equation (2.3) in practice. For t 2 the effect of the largest count is small but beyond twice the initial experiment the Good & Toulmin power series will diverge to positive or negative infinity depending on whether the largest duplicate count is odd or even (Figure 2.1). Accordingly, most applications of this formula are restricted to this range (e.g. [32, 88]). Applications associated with deep sequencing, however, require accurate estimates far outside this range, implying the need to increase the practical radius of convergence of the power series (2.3). 8 2.2 Rationalfunctionapproximation Previously suggested methods for increasing the range of practical application of the Good & Toulmin estimator, such as Euler’s transformation, do not perform well for large values oft (Figure 2.1), even for experiments significantly smaller than a typical sequencing experiment [61]. Euler’s transformation is a common tool in increasing the radius of convergence for divergent series or to speed up the rate of convergence for difficult to sum series, particularly alternating series [51]. The transformed series is a power series in the variableu = 2(t 1)=t [31], i.e. 1 X j=1 (1) j+1 (t 1) j n j = 1 X k=1 k u k with k = k X j=1 k 1 j 1 2 k (1) j+1 n j ; so that the transformed series form is a rational function int. It is plausible that more accurate results may be obtained by considering optimal approximations in this larger class. The coefficients of the optimal rational function approximation to the Good & Toul- min estimator, 1 X j=1 (1) j+1 (t 1) j n j = (t 1) p 0 +p 1 (t 1) +::: +p P (t 1) P 1 +q 1 (t 1) +:::q Q (t 1) Q +O((t 1) P +Q+1 ); (2.4) of given degreesP andQ, are determined by requiring the firstP +Q + 1 coefficients of the associated power series to be equal to the first P +Q + 1 coefficients of the original power series [5]. The form (2.4) is commonly known as a Pad´ e approximant. The fundamental idea is that the initial experiment gives us reliable information on how the function behaves in a neighborhood aroundt = 1; our approximation should use this information while remaining globally well-behaved. This indicates thatP andQ should 9 0 100 200 distinct reads (millions) 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 sequenced reads (millions) Figure 2.2: Estimated versus observed yield for several choice of the rational func- tion approximation degree, (P;Q). The library is Human Sperm bisulfite sequencing mapped rmapbs [83]. be chosen to be nearly equal. By testing rational function approximations in this range we find thatQ =P + 1 give the best accuracy (Figure 2.2). An equivalent way to represent a rational function approximation is by a truncated continued fraction expansion of the power series, i.e. 1 X j=1 (1) j+1 (t 1) j n j = (t 1) a 0 1 + a 1 (t 1) 1 + . . . 1 +a C (t 1) +O (t 1) C+1 : (2.5) The associated power series agrees withf(t) for the firstC +1 coefficients. We refer the valuesP +Q + 1 andC + 1 as the the order of the approximations. Since Pad´ e approx- imants are unique [5], these two forms of optimal rational function approximations are equivalent. If we require C to be odd in equation (2.5) then the truncated continued fraction will correspond to a Pad´ e approximate withP =Q + 1. The two representations differ in the ease with which they can be estimated and evaluated. Typical methods to calculate the Pad´ e coefficients involve solving a sys- temAx = b whereA is a Hankel matrix. There are three problems with this method. 10 First, Hankel matrices are known to be very ill-conditioned, with the absolute condi- tion number growing exponentially in the size of the matrix [103]. Additionally the time complexity for computing the coefficients is (Q 3 +P 2 ) [5]. This complexity is not desirable if the coefficients must be computed frequently, as when large numbers of bootstrap samples are required. Finally, direct evaluation of rational functions in the usual representation as a pair of polynomials (i.e. the numerator and the denominator of Equation (2.4)) can be problematic for larget and large degrees. There are possibilities of numerical overflow and a loss of precision with dividing two large values. The coefficients fa 0 ;:::;a C g of the continued fraction representation, on the other hand, can be fit using recursive algorithms like the quotient-difference (Algo- rithm 1) [81, 53] and product-difference algorithms [48]. Each of these takes (k 2 ) time to approximate an orderk polynomial. By avoiding the inversion of an ill-conditioned matrix the computation of the required coefficients is also more numerically stable. We find that the quotient-difference is more stable than the product-difference algorithm as intermediary terms in the algorithm can be extremely large, resulting in significant rounding error in the estimated coefficients. Evaluating the rational function when represented as a continued fraction is more numerically stable for large values oft. Simple evaluation of the ratio of the two poly- nomialsp(t) andq(t) may result in large rounding error for larget. When evaluating continued fraction expansion we can use Euler’s recursion with renormalization [11] to accurately evaluate continued fraction expansion without significant rounding error. For a continued fraction given by the right hand side of equation (2.5), successive approxi- mations of the continued fraction,A k (t)=B k (t), are given by A k (t) =A k1 (t) +a k (t 1)A k2 (t); B k (t) =B k1 (t) +a k (t 1)B k2 (t); 11 Input: Power series coefficientsf 0 ;f 1 ;::: forf(t) = P n0 f n t n . Output: Continued fraction coefficientsa 0 ;a 1 ;::: in equation (2.5). 1: forj = 0; 1;:::do 2: q j;1 f j+1 =f j 3: endfor 4: for j = 0; 1;::: andi = 1; 2;:::do 5: e j;i q j+1;i q j;i +e j+1;i1 6: q j;i+1 q j+1;i e j+1;i =e j;i 7: endfor 8: fori = 0; 1;:::do 9: if i mod 2 = 1then 10: a i e 0;(i+1)=2 11: else 12: a i q 0;(i+2)=2 13: endif 14: endfor Algorithm1: Quotient-Difference algorithm withA 1 (t) = 0;A 0 (t) = a 0 ;B 1 (t) = 1;B 0 (t) = 1. WhenA k (t) andB k (t) become large, both can be divided by the same value to keep the ratio constant while preventing error. 2.2.1 Instabilitiesinrationalfunctionapproximations We break up the observed coefficients into their expected values plus mean zero error terms, we can see that the resulting series is equal to the expected series plus an infinite power series with mean zero coefficient, 1 X j=1 (1) j+1 (t 1) j n j = 1 X j=1 (1) j+1 (t 1) j En j + 1 X j=1 (1) j+1 (t 1) j j : In practice there is more error, relatively, in the higher frequencies. For example if we see a particular read 1000 times, it could have just as likely been observed 999 or 1001 times. Commonly these higher count frequencies are observed as equal to zero 12 0 100 0 5 10 15 20 sequenced reads (millions) distinct reads (millions) 50 observed yield estimated yield for a degree 10 continued fraction approximation Figure 2.3: Observed versus estimated yield from a continueds fraction approximation of degree 10. Library is mRNA-seq of human adipose derived stem cells [73]. but they actually have positive, though small, expectation. This was a major motiva- tion for smoothing the observed coefficients when applying the Good & Toulmin power series [45]. The scale of sequencing experiments gives relatively more accurate esti- mates of the expected coefficients than traditional capture-recapture experiments. This eliminates the need for smoothing, though the errors still create problems for estimation. Errors in the observed coefficients of the original power series will typically result in poles in the rational function approximation with corresponding zeros in a neigh- borhood, resulting in approximate cancellation outside a small neighborhood [4] (Fig- ure 2.3). This phenomenon will be transitory as the order of the approximation changes [39, 40] . An additional advantage of the continued fraction representation over the Pad´ e approximation is that we can easily identify the locations of potential defects by using a necessary condition on individual coefficients of the continued fraction. We can then evaluate the continued fraction in neighborhoods of selected points to check for defects, rather than evaluating the rational function through the entire domain. If any of these potential defects is found to actually be a defect, we immediately know the depth at which the continued fraction should be truncated to remove that defect, so we can adjust the continued fraction without refitting. 13 74 78 76 12 8 16 12 8 16 74 78 76 a b c d estimated yield (millions) Human Sperm BS-seq Human ADS mRNA-seq simple estimapolation bootstrapped median extrapolation Figure 2.4: Observed histograms from 500 independent 5 million reads samples extrap- olated out 20 fold (to 100 million reads) for (a) Human Sperm bisulfite sequenc- ing [83] rational function approximation (RFA) extrapolations, (b) Human Sperm bisul- fite sequencing bootstrapped RFA median extrapolation, (c) Human adipose derived stem cell mRNA-seq [73] RFA extrapolations, and (d) Human adipose derived stem cell mRNA-seq bootstrapped RFA median extrapolation. 2.3 Bootstrappingthehistogram We observe a large variance in our estimates (Figure 2.4 a and c). In order to obtain more accurate estimates, we perform bagging [12] on the estimated complexity curve, bootstrapping the curve to obtain multiple estimated curves and taking the median curve as the estimated curve. This results in a much tighter and more accurate distribution of estimated curves (Figure 2.4b andd). 14 The likelihood of the full count frequencies can be written as S n 0 ;n 1 ;::: = S n 0 ;n 1 ;::: Y j0 f(j;S;) n j = D n 1 ;n 2 ;::: Y j1 f(j;S;) 1f(0;S;) n j S D f(0;S;) n 0 1f(0;S;) Sn 0 (2.6) wheref(j;S;) = S R 1 0 e j =j!d() [107, 90]. The first part of equation (2.6) is the conditional distribution of the observed count frequencies. To bootstrap the his- togram, D distinct counts are sampled with probabilities proportional to the observed count frequencies. 2.4 Timecomplexity The time complexity of constructing the expected yield curve of a sequencing library can be broken up into four conditionally independent parts: 1. reading in the data and constructing the counts histogram; 2. resampling the counts histogram to bootstrap the coverage curve; 3. interpolation of the complexity curve; 4. extrapolation of the complexity curve. 2.4.1 Readinginthedataandconstructingthecountshistogram. We require the input initial experiment to have reads sorted by mapped position, a stan- dard part of any read mapping pipeline. This takes a worst case time complexity of O(N logN). After sorting, reads that start at the same position will be next to each other 15 so that a simple one time pass through can construct the duplicate counts. Methods of calling duplicate reads other than start position may result in different time complexity. 2.4.2 Resampling the counts histogram to bootstrap the median curve. The distribution of the observed counts histogram is multinomial from D classes and probabilities R 1 0 e j =j!d() j=1;:::;M [107], whereM is the maximum observed duplicate count. To bootstrap the histogram we take a random sample from a multi- nomial (D; (n 1 =D;:::n M =D)) distribution. This is implemented through GSL [35], that uses the conditional binomial method [26]. Essentially each sampled count ^ n i is generated sequentially and is distributed binomial withD P i1 j=1 ^ n j trials and success probabilityn i = D P i1 j=1 n j . Every sample from a binomial random variable requires timeO(1). The total time complexity is therefore proportional to the number of distinct entries in the coverage count histogram, that is at mostM. Interpolationofthecomplexitycurve. Interpolation of the complexity curve involves sampling reads without replacement from the set of observed reads and then counting the number of distinct reads observed. In the worst case the algorithm runs in time complexity linear in the input [65], which is the number of observed reads. Therefore the time complexity isO N at each interpolation point. 16 Extrapolationofthegenomiccoveragecurve. We use the quotient-difference algorithm [81] to determine the continued fraction approximation to the Good-Toulmin power series and extrapolate the complexity curve. The quotient-difference algorithm runs in time complexity quadratic in the degree of the power series. The degree of the power series can be bounded by the maximum observed count. Therefore the total time complexity of the extrapolation is bounded above by O(M 2 ). In practice this is much faster since we bound the number of terms used in the continued fraction approximation. 2.5 Yieldextrapolationresults One of the primary motivation of the development of this method was the observation that the number of observed distinct reads in the initial experiment is not a good predictor for the number of distinct reads when sequencing deeper. The complexity curves for two different libraries may cross so that a library that initially yields more distinct reads may not yield more with deeper sequencing. Note that since complexity curves are convex and non-decreasing, two curves can only cross once. A good indicator of an accurate and useful method is that such crossings of complexity curves are accurately predicted to ensure the selection of best library when presented with multiple libraries for deep sequencing. We present an example from a study investigating the methylation dynamics in germ cell development in primates [83]. The two libraries shown are human and chimp sperm bisulfite sequencing libraries. The initial experiment of five million reads shows that the human library yields more distinct reads, and accordingly this library was sequenced deeper. Yet after sequencing approximately 25 million reads for each library the chimp 17 0 2 4 0 2 4 0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120 1 3 5 1 3 5 sequenced reads (millions) distinct reads (millions) Human Sperm estimated yield: Chimp Sperm estimated yield: ZTNB , ET , RF ZTNB , ET , RF a b c observed: observed: Figure 2.5: Observed versus estimated yield curves for Human Sperm BS-seq and Chimp Sperm bisulfite sequencing libraries [83]. (a) Complexity curves for initial exper- iments of 5M reads. (b) Complete experiment complexity curves with estimated yields by zero-truncated Negative Binomial(ZTNB), Euler’s transformation (ET), and rational function approximation (RF) applied to Equation (2.3). (c) Rational function approxi- mation predicted complexity curves. library yielded more distinct reads. We use the initial experiment to extrapolate the com- plexity curve using a heterogeneous parametric model (zero-truncated Negative Bino- mial), Euler’s transformation applied to the Good & Toulmin power series, and rational function approximations to the Good & Toulmin power series (Figure 2.5). The Nega- tive Binomial model does not predict the crossing, while both Euler’s transformation and rational function approximations predict the crossing. Euler’s transformation diverges quickly after the crossing while the rational function approximations stay stable and accurately predict the curves for both libraries. In our previous paper on this problem [24], we made comparisons of the observed yield of distinct reads to the estimated yield of distinct reads using five million mapped read initial experiments for multiple types of experiments including bisulfite, whole genome, exome, RNA, and chromatin immunoprecipitation (ChIP) sequencing. In the 18 meantime we have made heuristic improvements to the software to increase the accu- racy of the predictions. Here we compare the observed complexity to the complex- ity estimated from initial samples of five million reads. All DNA, exome, and ChIP sequencing libraries were mapped with Bowtie2 [69], the bisulfite sequencing libraries were mapped with RMAPBS [96], and the RNA sequencing libraries were mapped with STAR [30]. Paired end libraries were run in paired end mode, where both ends of a read must match to call a duplicate. The estimated yield in distinct reads is highly accurate in all cases, even for extrap- olations significantly larger than the initial experiment. Extrapolating the yield beyond 80 fold of the initial experiment may not be accurate. For example in the two single cell DNA sequencing experiments of a tumor cell, one remains highly accurate beyond 100 fold of the initial experiment (Figure 2.6 (b)) while for the other the estimate diverges after approximately 80 fold (Figure 2.6 (a)). We previously noted that RNA-seq experiments were typically difficult to predict beyond 30 fold from an initial experiment of at least 4 M reads due to higher variance in the duplicate counts that results from larger departures from uniform abundance [24]. Here we observe the estimates are highly accurate, though we only extrapolate out to approximately 15 fold due to the depth of sequencing (Figure 2.7). Caution should still be used when extrapolating beyond 30 fold of the initial experiment or when using small initial experiments (< 4 M reads) for highly heterogeneous libraries. 19 0 200 400 600 mapped reads (millions) 0 200 400 600 distinct reads (millions) 0 200 400 600 0 200 400 600 0 50 100 150 0 50 100 150 0 50 100 150 0 50 100 150 0 20 40 0 20 40 0 40 80 0 40 80 0 200 400 0 200 400 0 200 400 0 200 400 0 200 400 0 200 400 mapped reads (millions) distinct reads (millions) mapped reads (millions) distinct reads (millions) a b c d e f g h i observed estimated initial experiment Figure 2.6: Observed and estimated yield from an initial experiment of five million mapped reads for (a) paired end DNA sequencing of a single tumor cell from cell line SW480 using MALBAC whole genome amplification [116], (b) paired end DNA sequencing of a single tumor cell from cell line SW480 using MDA whole genome amplification [116], (c) paired end DNA sequencing of a single sperm cell using MAL- BAC whole genome amplification [76], (d) paired end DNA sequencing of a sin- gle sperm cell using MDA whole genome amplification [108], (e) paired end exome sequencing of a single clear cell renal cell carcinoma cell using MDA whole genome amplification [112], (f) paired end exome sequencing of a single circulating tumor cell of a lung cancer patient using MALBAC whole genome amplification [84], (g) single end bisulfite sequencing of the frontal cortex [72], (h) paired end bisulfite sequencing of a placenta [100], and (i) paired end bisulfite sequencing of a primary adenocarci- noma [115] 20 0 100 200 mapped reads (millions) 0 100 200 distinct reads (millions) 0 20 40 60 0 20 40 60 0 40 80 0 10 20 0 20 40 60 0 20 40 60 0 40 80 0 40 80 0 30 60 0 30 60 mapped reads (millions) distinct reads (millions) a b c d e f observed estimated initial experiment Figure 2.7: Observed and estimated yield from an initial experiment of five million mapped reads for (a) single end BRD4 ChIP-seq of acute myeloid leukemia cell line MOLM-1 [49], (b) single end H3K9me3 ChIP-seq of primary fetal adrenal gland (Uni- versity of Washington Human Reference Epigenome Mapping Project), (c) single end STAT4 ChIP-seq of stimulated peripheral blood mononuclear cells [105], (d) paired end RNA-seq of a single cell from the GM12878 cell line using SMART-seq protocol [80], (e) paired end RNA-seq of a single neuron in vitro using the TIV A protocol [75], and (f) paired end RNA-seq of a single embryonic stem cell using the Quartz-Seq protocol [91] 21 Chapter3 Extrapolatinggenomecoverage For standard bulk genome sequencing of a population of cells the Lander-Waterman model [68] is commonly used in sequencing projects to guide the depth of sequenc- ing [94, 2]. For single cell and low input sequencing, the uniformity assumption is commonly lost for a myriad of reasons [6] but can be attributed to one fundamental dif- ference: the amount of starting material. Standard bulk sequencing takes a population of thousands or more commonly millions of individual cells to construct the sequenc- ing library. In contrast, for single cell sequencing the relevant DNA only exists in a single copy. Special protocols are needed to prepare DNA sequencing libraries in sin- gle cell applications. Whole-genome amplification (WGA) is conducted prior to PCR, with the goal of producing more copies of the genome in the form of long amplicons that hopefully cover the original genome. WGA is necessary because most standard library preparations, such as Illumina’s Nextera kit [1], specify a minimum input in the nanogram range [10] while the nuclear DNA of a human cell weighs approximately 80 picograms. Biases in WGA can dramatically alter the representations of different parts of the genome in the sequencing library [56]. Methods have been developed to minimize WGA amplification bias by reducing the limiting volume for multiple displacement amplifi- cation (MDA) to avoid exponential preferential amplification [41, 108] or looping of the amplicons to induce quasi-linear amplification (MALBAC; [116]). Despite these advances, libraries produced by whole genome amplification remain far from uniform. 22 A major problem in single cell and low input sequencing is that portions of the genome are uncovered even with extremely deep sequencing, indicating that these regions are absent from the library. There are multiple opportunities for portions of the genome to disappear in the library preparation, making them unavailable for sequenc- ing and subsequent observation. This situation is known as locus dropout and creates significant problems for later analysis [92]. For diploid cells, locus dropout presents the additional difficulty that the dropout of one allele is easily mistaken for homozygosity and leads to mistakes in later analysis. To select a model we must keep two things in mind: some portions of the genome may be missing and bias in the sequencing process resulting in non-uniform representa- tion. There are many biases inherent to high-throughput sequencing [94]. These prob- lems exist for single cell sequencing experiments but are exacerbated by the low starting material, in addition to biases specific to WGA. The highly non-uniform molecular abundances and unknown locus dropout in the sequencing library are not the only problems in specifying a model for the genome cov- erage in single cell sequencing experiments. The coverage of nearby bases will be highly correlated. There is the natural correlation caused by nearby bases being covered by the same read in addition to correlations due to similar molecular abundances of nearby regions. One example is the observation that priming efficiency and extension rate of the DNA polymerase29 used in MDA is dependent on nucleotide content, leading to uneven amplification [86]. These all create problems in mathematically modeling the sequencing process, as misspecification can create significantly biased estimates. We model the number of reads covering each position as a compound Poisson ran- dom variable. Instead of treating a sequenced read as an observations, we treat indi- vidual sequenced nucleotides as independent observations despite the fact that the true unit of sampling is the sequenced read. The loss of information from this assumption is 23 minimal since the length of a read is much smaller than the length of the genome. We can then use the techniques derived in chapter 2 to obtain a method for estimating the number of bases covered in a reference genome, i.e. the genome coverage. In particular we can adapt the non-parametric empirical Bayes estimator of Good & Toulmin [47] developed in chapter 2 to obtain accurate and stable extrapolations of the genome cover- age. Furthermore there are practical considerations specific to the problem of estimating the genome coverage, including methods to reduce the running time by approximating the single-base resolution genome coverage by non-overlapping bins and considerations in reducing the cost of initial experiments. 3.1 Model Assume a sequencing experiment samples molecules from a large pool of DNA frag- ments Further, assume that the number of amplified copies of each DNA fragment is sufficiently large that sampling can be assumed to be with replacement. Each molecule sampled during sequencing corresponds to a sequenced read. Each sequenced read will cover multiple bases in the genome: every sequenced nucleotide covers exactly one base. Our goal is to use information from a shallow sequencing run to predict the num- ber of bases that would be covered after deep sequencing. We call the shallow sequenc- ing run the initial experiment, and we make the essential assumption that the properties of the library do not change between the initial experiment and deeper sequencing. We use slightly different notation here than chapter 2. Instead we follow the notation of Lander & Waterman [68], and define the following symbols: 24 G = haploid genome length in nucleotides (nt) represented in the library; L = read length in nt; N = number of reads sequenced in the initial experiment; tN = number of reads sequenced in the full experiment; t = fold extrapolation; i = probability a randomly sequence read covers basei; ~ = ( 1 ;:::; G ); ~ = N~ , the expected number of reads covering each position. Consider the random trial of sequencing an individual read. This can also be con- sidered asL trials, one per nucleotide in the read. The outcomes of theseL trials cor- responds to coveringL consecutive bases in the genome. A second read whose origin partially overlaps the first will provide an additionalL trials, some of which will cover new bases (Figure 3.1). The part of the genome where the two reads overlap, however, will correspond to outcomes observed twice. Although the outcome of each of these LN trials is dependent onL 1 others, this dependence is highly-localized. Similarly, for a given base in the genome, the number of trials (sequenced nucleotides) whose out- come corresponds to that base also has a strong local dependence: the number of reads covering any given base is dependent on the number covering 2L 2 others. In a similar manner, Efron & Thisted [31] applied the general compound Poisson model to estimate Shakespeare’s vocabulary based on his printed works. Word counts are dependent since the way words may be combined in English text follows both gram- matical and stylistic constraints. For example, in Julius Ceasar, the word “thou” pre- cedes “art” more than twice as often as it precedes “the, ” despite the word “the” appear- ing fifty times more often than “art” in the entire play [95]. Efron & Thisted argued that this amount of dependence can be ignored in a large enough body of work. Unfortu- nately, the dependence introduced by sampling nucleotides as contiguous reads is much 25 0 10 20 30 genome coverage histogram genomic bases times covered reference genome mapped reads 1 2 3 4 5 6 a b c Figure 3.1: (a) Schematic with sites colored to indicate how genome coverages are tabulated, (b) the corresponding colored histogram and (c) estimated coefficients for the Good & Toulmin power series. stronger: by covering theith base in the genome, one greatly increases the probability that basei + 1 will be covered. For our application, however, the relationsNL andGL both hold in practice, meaning that dependence between events (nucleotides in reads) and outcomes (covered bases in the reference genome) are both extremely limited. When considering the out- comes, since we assume that reads are multinomially sampled, as we previously con- sidered, the number of reads covering a position depend only on the L 1 positions downstream andL 1 positions upstream. Since the number of reads starting at differ- ent positions are assumed to be independent, the set of dependence is 2L 1 bases and any two bases separated by more than 2L 1 bases will be independent. The propor- tion of dependent observations will be much smaller than the proportion of independent observations. We therefore treat all bases as independent, with little loss in accuracy. Similar to the previous section, define n j (t) as the number of bases covered by exactlyj reads after sequencingtN reads and letn j denote the number of bases cov- ered by j reads in the initial experiment (i.e. n j = n j (1)). We call the full vector (n 1 ;n 2 ;:::) the coverage counts histogram. Assume that the number of reads that cover each position follows a Poisson process with rates independently and identically dis- tributed according to, an arbitrary distribution with meanNL=G. This is known as 26 the non-parametric compound Poisson model [107] and under this model the expected value ofn j (t) is equal to E n j (t) =G Z 1 0 e t (t) j =j! d(): (3.1) The number of positions in the genome that are not covered by reads but can be covered with deeper sequencing,n 0 , is unknown. This implies thatn 0 is non-identifiable in the non-parametric model [71], so analysis must be done with the identifiable portions of the model, the counts (n 1 ;n 2 ;:::). This allows us to use the Good & Toulmin estimator given in equation (2.3) and the rational function approximation detailed in section 2.2 3.2 Accuracyofpredictions We evaluated our method on 19 single-end 100 or 101 nucleotide (nt) single cell sequencing experiments [108, 116, 76] that were “deeply-sequenced,” which is arbi- trarily defined as at least 100M reads. Each library was mapped to the human genome build hg19 with bwa version 0.6.2 [70] using default parameters. We randomly took 5 million (M) reads from each library to simulate the initial sequencing experiment and compared the estimated genome coverage curve to the observed genome coverage curve, calculated by downsampling the library and using BEDTools [87]. The accuracy of our estimates is measured by the relative error, which is defined as the estimated genome coverage minus the observed divided by the observed. A negative value will indicate an underestimate of the coverage and positive value will indicate an overestimate. Even for long-range extrapolations, the estimates remain stable and accurate. When we use 5M sequenced 101 nucleotide reads (0.17x coverage) to predict the genome coverage for 100M sequenced 101 nucleotide reads (3.33x coverage), a 20-fold extrap- olation, we observed a mean absolute relative error of< 3%. 27 We noticed that our choice of order of the rational function approximation tends to give conservative estimates of the genome coverage. Theoretically the distribution of the curves should be symmetric about the mean but we observed a downward skew to curves (D’Agostino skewness test [23] on the relative error for the 20 fold extrapolations p< 3E-16). We investigated bootstrapping to reduce the skew and lower the variance of our estimates [12]. The bootstrapped median shows significant improvement over the simple extrapolations and even the bootstrapped mean (Figure. 3.2), indicating that the use of bootstrapping and aggregating the median curve leads to more accurate estimates. 3.3 Practicalconsiderations 3.3.1 Binningtoapproximatecoverage The major computational demands in the extrapolation algorithm come from bootstrap- ping the coverage counts histogram to reduce variance of the estimator (see section 2.4 but with the total number of observations equal toNL). The time required for resam- pling is a function of the number of possible outcomes for each event, in this case individual genome positions. An approach to reduce the running time is to partition the genome into non-overlapping bins and use the bins as the outcomes of each ran- dom trial. Given an estimate for the number of bins that would be covered after deep sequencing from the library, the genome coverage can be estimated by multiplying that number of bins by their size (or their average size, if the bins had varying sizes). We call this approach as a binning based estimator. It is not difficult to see that counts for bins can be supplied to the Good & Toulmin estimator in the same way as counts for individual nucleotides as shown in Figure. 3.1. We assume the genome has been partitioned into equal sized bins. There are several possible schemes one may follow to construct a binning based estimator. If we say 28 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 Frequency 0 10 20 30 0 10 20 30 0 10 20 30 Frequency 0 10 20 30 0 10 20 30 0 10 20 30 Frequency 0 10 20 30 0 10 20 30 0 10 20 30 MALBAC diploid MALBAC haploid MDA haploid Simple extrapolation Mean extrapolation Median extrapolation Frequency Frequency Frequency Frequency Frequency Frequency Figure 3.2: Distribution of the relative error for the rational function approximation (RFA) 20 fold extrapolation, the bootstrapped mean RFA 20 fold extrapolation, and the bootstrapped median RFA 20 fold extrapolation for the MALBAC diploid, MALBAC haploid, and MDA haploid libraries. that a read covers a bin only if it completely covers the bin then we will consistently underestimate the coverage, and this excludes use of bins larger than the read length. Conversely, if we say a read covers a bin if there is any overlap then we will consistently overestimate genome coverage. A reasonable approach is to say that a read covers a bin with probability proportional to the number of nucleotides in the read within the 29 genome Prob =1/5 Prob = 1 5 bp bins Prob = 4/5 genome 5 bp bins counts 0 1 1 2 2 1 0 1 1 Example: 10 bp reads Figure 3.3: A toy example of the probabilistic binning strategy with 5 nucleotide (nt) bins and 10 nt reads. bin (Figure. 3.3). This ensures that, for a single read, the probability that any position is covered by that read is unbiased. Unfortunately one can demonstrate that even for binning based estimators that are unbiased for a single read, they will be biased for multiple reads (Appendix A.1). This also works in the other direction, if a strategy is biased for a fixed number of multiple reads then it will be biased for any other number of reads. The binning based estimator does indeed tend to slightly underestimate coverage. Despite this, we observe that the binned coverage counts histogram is approximately proportional to single-base resolution coverage counts histogram. This holds even when the bin size is larger than the read length, so that reads are randomly thrown out or extended to cover whole bin while on average covering the same number of bases (Fig- ure. 3.4a). The gain from binning is a reduction in running time proportional to the bin size (Figure. 3.4b). This comes at an acceptable cost in accuracy (Figure 3.4a). 30 0 2.9 5.8 8.7 11.6 14.5 total coverage 0 1 2 3 covered bases (billions) a b 0 50 100 150 200 250 running time (minutes) bin size (log scale) observed 1 bp bin initial experiment 50 bp bin 200 bp bin 10 bp bin ● ● ● ● ● ● ● ● 1 5 20 100 1 5 20 100 40 50 60 70 bin size (log scale) c σ (millions of base pairs) Figure 3.4: (a) Observed coverage curve versus estimated using bin sizes of 1, 10, 50, and 200 using the same 5M 100 nucleotide single end read initial experiment from MALBAC diploid library SRX202787. (b) Running times to extrapolate the curve for different bin sizes. (c) Observed standard deviation of the full extrapolation estimates (to 398M mapped reads) for different bin sizes from 250 independent 5M read down- sampled initial experiments The binning procedure naturally introduces variance in the estimated coverage count histogram. This variance increases with the bin size since portions of reads are more likely to be thrown out or to be extended with increasing bin size. On the other hand, the binning will group close neighboring bases together and reduce the variance introduced when treating close bases as independent. Accordingly the variance of the estimated coverage should achieve a minimum lying somewhere between single nucleotide and the read length (Figure. 3.4c). Timecomplexityofbinning We assume, as previously (Section 2.4), that the reads are sorted, which has a time complexity of O(N logN). For simplicity we will assume that reads are single end and include no gaps. Let M be the largest coverage count. M can be as large as the 31 number of reads sequenced if every read maps to the same location but is much smaller in practice. The largest number of reads that can cover any particular base isML. Each read is processed sequentially. For bin sizeb, we compute the consecutive non- overlapping bins of the genome that the read overlaps. The read partially covers at most d L b e + 1 bins. We perform our random binning on each bin, splitting up the read into at mostd L b e + 1 parts which we call the binned reads. These parts are put in a priority queue, sorted by start position, to save for comparison. When we reach a read that is far enough away from the head of the priority queue, determined by the smallest mapped position, we start outputting binned reads till the head of the priority queue is within the specified distance. When we process and split the reads we perform at most a total of O N(d L b e + 1) operations. For each binned read insertion and deletion in the priority queue takes O(logn) in the worst case, wheren is the size of the priority queue. We can bound the size of the queue by the maximum number of binned reads at each bin times the number of distinct binned reads in the queue at one time, ML(d L b e + 1). Therefore the worst case time complexity in constructing the coverage counts histogram is O N L b + 1 log ML L b + 1 : 3.3.2 Effectsofreadlengthoncoverageestimates At present the vast majority of whole-genome sequencing experiments, including sin- gle cell experiments, use Illumina sequencing technology. Reads from this technology correspond to the ends of DNA fragments. Suppose one sequences lengthL nucleotide reads that map unambiguously to a reference genome, and that the DNA fragments in the library all have length greater thanL. Using information about the genome coverage from that experiment, one could compute what the genome coverage would have been 32 had the reads been of lengthL + 1. This can be done by “extending” the reads in silico, i.e. pretending the reads were longer. The cost of a sequencing experiment is approximately a linear function of the num- ber of reads sequenced and the length of reads. For a variety of reasons, including optimization of total throughput at sequencing centers, investigators are often presented with a relatively limited set of options. Often there are options for a few standard read lengths that include both “short” and “long” options (at present, e.g., 36 nt and 100 nt). An interesting question is whether an approach of extending reads in silico could help minimize the cost of the initial experiment while still accurately predicting genome coverage in deeper sequencing with true longer reads. In line with the assumption above, we expect that in silico read extension will work as long as the reads are mapped to their correct positions at shorter read lengths. While this places a lower bound on the lengths of reads that can be used, there are only small differences in mappability for reads longer than 36 nt [29]. To evaluate this approach, we examined four libraries given by Sequencing Read Archive accession numbers SRX151616, SRX202787, SRX204160 and SRX205367, which correspond respectively to an MDA haploid, MALBAC diploid, MDA diploid, and MALBAC haploid libraries. We downsampled 5M reads originally of length 100 nt, and then truncated the original reads at 50 nt prior to mapping. We tested the results for extending the reads by 50 nt in silico versus extending the fold-extrapolation. We observed that the artificially extended reads consistently overestimate the genome cov- erage (mean relative error across all libraries = 0:018). On the other hands, extending the fold-extrapolation on the mapped truncated reads shows an increase in relative error compared to the full length reads (mean relative error of0:028 across all libraries). 33 This is not unexpected since we are extrapolating further out and should expect our esti- mates to be less accurate. This indicates that artificially extending the reads in silico is the incorrect strategy and we should simply extend the fold extrapolation. We next examined the effect of read length on the estimates while holding constant the total number of nucleotides sequenced. For each of the four libraries, we randomly sampled 5M reads at 100 nt, and 10M reads at 50 nt. The 50 nt read samples con- sistently yield more accurate predictions of genome coverage and have lower variance (Figure. 3.5). This result suggests that any bias associated with mappability and frag- ment length variance, at least for 50 nt reads and the human reference genome, may be acceptable to reduce costs for library test runs. 3.3.3 Doesincludingduplicatereadsimpactpredictions? It is standard in many sequencing applications to remove all but one read among a set believed to be amplified copies of the same original DNA fragment [94]. Multi- ple options are available for identifying duplicate reads [63]. For most current methods of identifying amplification duplicates, all duplicate reads for a given molecule will map to the same position in the genome. As a consequence, the genome coverage for a given sequencing experiment will be identical whether or not duplicates are removed. Moreover, in theory the number of distinct reads that cover a particular position should also approximately follow a Poisson process. This observation suggests two possible avenues for using the Good & Toulmin genome coverage estimator to predict coverage from deeper sequencing. The direct approach is to ignore duplicate reads when mak- ing predictions based on the test run. A different approach would be to make estimates using only unique reads, and then to make predictions based on (i) the number of unique reads in a deeper experiment, and (ii) to then predict genome coverage conditional on 34 covered bases (billions) sequenced bases (billions) covered bases (billions) 0 10 20 30 40 50 sequenced bases (billions) 0 10 20 30 40 50 sequenced bases (billions) 0 5 10 15 sequenced bases (billions) 0 5 10 15 0 0.2 0.4 0.6 0.8 1 0 1 2 3 covered bases (billions) 0 1 2 3 covered bases (billions) 0 0.5 1 1.5 observed coverage initial experiment: 5e8 sequenced bases mean coverage curve from 5 million 100bp reads mean coverage curve from 10 million 50bp reads 95% confidence interval 95% confidence interval SRX151616 SRX202787 SRX204160 SRX205367 Figure 3.5: A comparison of error when using 5 M 100 nt reads versus 10 million 50 nt reads for four libraries representative of the four experiment types: SRX151616, MDA haploid; SRX202787, MALBAC polyploid; SRX204160, MDA polyploid; SRX205367, MALBAC haploid. 100 independent initial experiments were taken and used to calculate the mean estimated coverage curve and 95 % confidence intervals for each of the four libraries. the number of unique reads sequenced in the full experiment (Figure. 3.6). This sec- ond approach includes two predictive steps, which will increase the variance. On the other hand neither of these steps would be individually extrapolating as far as the direct approach and each would have lower variance since we know that the non-parametric empirical Bayes framework is more accurate when extrapolations are less extreme. Although genome coverage estimates with duplicates and without duplicates should be equal, the variances may differ. Therefore we compare the results from the same library under both cases extrapolated to the same genome coverage achieved at 100M 35 0 100 200 300 400 distinct mapped reads (millions) 0 0.5 1.0 1.5 2.0 2.5 3.0 covered bases (billions) observed 1 bp bin initial experiment 50 bp bin 200 bp bin 10 bp bin Figure 3.6: Initial experiment of 5 M reads with duplicates (as determined by mapping position) removed results in 4,986,080 distinct mapped reads. Extrapolated coverage curves against the coverage curve for distinct mapped reads using 1 nt (nucleotide), 10 nt, 50 nt, and 200 nt bins. reads with duplicates included, so that the fold-extrapolation is less for duplicates removed. For some cases (MALBAC libraries), the performance with duplicates removed is comparable to the performance when including duplicates. But for the MDA haploid libraries, the performance is worse (pairedt-test,p< 3E-16). We therefore rec- ommend that the genome coverage is predicted with duplicated included, but note that it can also be done with duplicates removed. 36 Chapter4 Extrapolatingthemin-count complexitycurve The problem of predicting the number of distinct reads or events observed at least once was discussed previously in chapter 2. We now consider the problem of predicting the number of events that are observed at least a fixed number of times. This can be consid- ered a generalization of the problem discussed in chapter 2 by setting the minimum to one. In traditional capture-recapture problems an event would be the capture of a single species or sampling of a class. In sequencing, events can be sequencing a read as dis- cussed in chapter 2, a base being covered by a distinct read as discussed in chapter 3, a junction in RNA-seq, a read overlapping a reference exon or gene, and so on. This problem is of interest in situations where the event is required to be observed a minimum number of times to prevent erroneous identification such as is in exome sequencing, where a position should be covered multiple times to prevent erroneous variant calls due to low sequencing coverage [66]. 4.1 Deriving a non-parametric empirical Bayes estima- tor We assume the model of chapter 2. For a fixedk > 1, define k (t) to be the number of distinct events observed at leastk times in the full experiment. An event is counted 37 in k (t) if it was observed less than k times in the initial experiment but is observed at least k times in the full experiment. Following the methods of section 2.1 we can derive an unbiased power series estimator for k (t) by first noting that k (t) is equal to the number of events observed at leastk times in the full experiment but not in the initial experiment. This is also equal to the difference of the expected number of events observed less than k times in the initial experiment and the full experiment. Using equation (2.1) we can derive a power series expansion for k (t) as k (t) =S k1 X r=0 Z 1 0 e r =r!d ()S k1 X r=0 Z 1 0 e t (t) r =r!d () =S k1 X r=0 1 r! Z 1 0 e 1t r 1 X j=0 (1) j (t 1) j j j! ! r d () =S k1 X r=0 Z 1 0 e r r! d () +t r 1 X j=0 (1) j+1 (t 1) j j!r! Z 1 0 e j+r d () ! = k1 X r=0 En r +t r 1 X j=0 (1) j+1 (t 1) j r +j j En r+j ! : (4.1) Note that whenk = 1, equation (4.1) is exactly equal to the Good & Toulmin estima- tor of equation (2.2). Furthermore Good & Toulmin [47] derived an unbiased estimator for the number of events observed exactlyr times in the full experiment as E n r (t) =t r 1 X j=0 (1) j r +j r (t 1) j E n r+j : (4.2) We can see that equation (4.1) is equal to the sum of the expected differences in the number of events observed less thatk times. To apply a rational function approximation to the power series (4.1), we will need to expand it in a single power series instead of a sum ofk different power series. By the binomial theorem, we can expandt r = 1 + (t 1) r = P r l=0 r l (t 1) l . Substituting 38 this into equation (4.1) and rearranging terms allows us to write this a single power series: k1 X r=0 En r +t r 1 X j=0 (1) j+1 (t 1) j r +j j En r+j = k1 X r=0 En r + r X l=0 r l (t 1) l 1 X j=0 (1) j+1 (t 1) j r +j j En r+j = k1 X r=0 En r 1 X j=0 (t 1) j j X l=0 (1) l r +l r r jl En r+l = k1 X r=0 1 X j=1 (t 1) j j X l=0 (1) l+1 r +l l;il;r +li En r+l = 1 X j=1 (t 1) j k1 X r=0 j X l=0 (1) l+1 r +l l;il;r +li En r+l : (4.3) Substituting the unbiased observed count frequencies, n j , for the expected count frequencies, En j , will give an unbiased estimator. For brevity we will refer to this as the NPEB (non-parametric empirical Bayes) min-count estimator. 4.2 RationalfunctionapproximationstotheNPEBmin- countestimator The Good & Toulmin power series (equation (2.3)) is in the form of an alternating power series. This helps in making the rational function approximations relatively well behaved. The NPEB min-count power series is not necessarily alternating, and this may complicate the application of rational function approximations. Additionally, the Good & Toulmin power series 2.3 is guaranteed to have a radius of convergence of 1. The NPEB min-count power series does not have this guarantee and we should expect the 39 0 4 8 12 0 4 8 12 0 4 8 12 0 4 8 12 0 4 8 12 0 4 8 12 0 200 400 600 mincounted bases (millions) 0 200 400 0 100 200 300 mincounted bases (millions) 0 100 200 0 100 200 0 100 200 mincount = 2 mincount = 3 mincount = 4 mincount = 5 mincount = 6 mincount = 7 observed mincount curve estimated mincount curve by RFA initial experiment sequenced bases (billions) sequenced bases (billions) Figure 4.1: Observed curves for the number of positions covered by at least a minimum of 2; 3; 4; 5; 6; and 7 reads in a exome sequencing experiment of a single circulating lung tumor cell [84] with the predicted curves estimated by rational function approximation (RFA) from an initial experiment of 5 million reads radius of convergence to decrease with increasingk, as indicated by the first term,t r , in equation (4.2). 4.3 Min-countextrapolationresults We test the minimum count estimator by looking at the number of positions covered at least k times for k = 2; 3; 4; 5; 6; 7 as a function of bases sequenced in a sin- gle cell exome sequencing experiment of a circulating tumor cell from a lung can- cer patient [84]. The experiment yielded approximately 70 million paired end 101 40 nucleotide reads, that we mapped to human genome assembly hg19 with bowtie2 [69]. To sample the initial experiment, approximately five million paired end reads were downsampled from the initial experiment using Picard DownsampleSam (http:// picard.sourceforge.net/) and the coverage counts histogram was constructed using BEDTools genomeCoverageBed [87]. As expected, the accuracy of the rational function approximations to equation (4.3) tends to decrease with increasing k (Figure 4.1). The estimates for k = 2 are highly accurate and the estimates fork = 6 are initially highly accurate but separate from the true curve quickly. On the other hand, the estimates fork = 7 appear to be accurate for moderate extrapolations. It is clear that our understanding of the NPEB min-count estimator is weak. Signif- icantly more work is needed to understand its properties and how to construct accurate and stable rational function approximations. 41 Chapter5 Predictingsaturation The total relative abundance of unobserved reads is equal to the sum of the relative abundances (or probabilities) of the unobserved reads, i.e. forp i = i = P , (t) = S X i=1 p i 1 y i (t) = 0 : (5.1) This has been referred to as the discovery probability [74, 33], while quantity 1(t) is commonly known as the sample coverage [44, 78]. In the DNA sequencing literature, coverage usually refers to the depth of sequencing, i.e. 10x coverage refers to sequenc- ing ten times as many bases as the reference genome [94]. To avoid confusion we will refer to 1(t) as the saturation, following previous work [99]. In a manner similar to equation (2.2), Good & Toulmin [47] derived a non- parametric empirical Bayes estimator for the discovery probability in the form of the alternating power series ^ (t) = 1 N X j0 (1) j (t 1) j (j + 1)n j+1 : (5.2) The expectation of ^ (t) is equal to P S i=1 p i e t i , so that ^ (t) is not unbiased for (t). The discovery probability(t) is non-identifiable fort> 1 [52, 64]. Yet it can be shown that ^ (t) is unbiased for the identifiable part of(t) and the difference goes to zero as the number of distinct classes goes to infinity [78]. 42 We make an observation that allows us to derive ^ (t) in a simple and direct manner using the previous results of chapter 2. Namely, we note that the discovery probability is proportional to the derivative of the complexity curve. This allows us to derive a simple method to calculate ^ (t) by using the method of chapter 2 to extrapolate the complexity curve and then take the derivative. We will show that the estimated saturation is highly accurate and avoids the instability problem of the Good & Toulmin power series. 5.1 Using the derivative of the complexity curve to esti- matethesaturation From chapter 2 recall that (t) is the number of previously unobserved distinct reads expected from additional sequencing. The derivative of (t) is defined as @ @t (t) = lim !0 (t +) (t) ; with the approximate equality for small. Consider sequencing a single read, corresponding to = 1=N. The quantity (t + 1=N) (t) is equal to the expected change in the yield curve if a single read is sequenced. This is also equal to the probability of observing a new read with the next sequenced read. Therefore (t + 1=N) (t) = S X i=1 p i 1 y i (t) = 0 =(t): We can obtain an approximate formula for(t), (t) 1 N @ @t (t): (5.3) 43 We can verify equation (5.3) by looking at the derivative of the Good & Toulmin power series for (t), @ @t X j1 (1) j+1 (t 1) j En j = X j1 (1) j+1 j(t 1) j1 En j =N(t): Recall that the error to a continued fraction approximation to ^ (t) of depth C is O (t 1) C+1 , i.e. equation (2.4). Taking the derivative implies that the error isO (t 1) C . To estimate the derivative we use the complex derivative [97]. For small> 0, @ @t (t) Im (t +i) =: To construct the full curve we need to interpolate the saturation. To do this we can subsample from the observed data and use the Good & Turing estimate for the saturation, equal to one minus the number of singleton reads divided by the number of subsampled reads [44]. This is known to have an error proportional to one over the number of subsampled reads [9]. 5.2 Saturationextrapolationresults We will show this method accurately reconstructs the full saturation curve. To do this we need a case where we know the true properties of the library, which is impossible if the library has not been sequenced to saturation. Therefore we take a library that has been sequenced deeply and call the observed experiment the population. To simulate sequencing from the library we sample reads with replacement. This gives us a known case which we can compare our estimates against. We use a deep RNA-seq library from a mouse tumor expressing transgene MLL-AF9 consisting of 1,196,520,129 single end 101 nucleotide reads sequenced on an Illumina HiSeq and mapped with Tophat2 version 44 2.0.6 [62] to mouse genome assembly mm9. This resulted in 1,021,240,281 uniquely mapped reads. Further details on the simulation can be found in section 6.3.1. We look at the saturation for three classes: distinct reads as determined by the mapped start and end position, distinct observed exons, and distinct observed junctions (which may include junctions that are not exon to exon). For the latter two classes, some reads may not correspond to a class or count toward two classes, e.g. a read covering an exon-exon junction would count towards each exon. If a read does not correspond to a class, it will not be included in the count histogram, but will be included inN so that the estimated discovery probability is relative to the entire experiment rather than relative to the other classes. The initial experiment is simulated by sampling sequenced reads with replacement from the full library. We simulate initial experiments of two million and five million reads and compare the estimated saturation to the true saturation for the three above cat- egories (Figure 5.1). We see that using the derivative of the complexity curve accurately extrapolates the saturation in most cases. We note that RNA-seq libraries typically have a few transcripts composing the vast majority of the abundance [99] so that libraries tend to be highly saturated after the initial experiment. Greater understanding of this method is still needed. A study of the effect of sample size and possible optimizations of the procedure is necessary as well as thorough study of the accuracy of the method in face of varying compounding distributions. 45 0 40 80 sampled reads (millions) 0 40 80 sampled reads (millions) 0 40 80 sampled reads (millions) 0 0.2 0.4 0.6 0.8 1 0.98 0.99 1 0.9 0.92 0.94 0.96 0.98 1 0 40 80 sampled reads (millions) 0 40 80 sampled reads (millions) 0 40 80 sampled reads (millions) 0 0.2 0.4 0.6 0.8 1 0.98 0.99 1 0.9 0.92 0.94 0.96 0.98 1 saturation saturation a b c d e f true saturation estimated saturation initial experiment Figure 5.1: Observed versus estimated saturation from an initial sample of 2 million reads for (a) distinct reads, (b) exons, and (c) junctions and from an initial sample of 5 million reads for (d) distinct reads, (e) exons, and (f) junctions. 46 Chapter6 Lowerboundsonthenumberof unobserved The problem of estimating the total number of classes in an unknown population based on a random sample from the population is commonly known as the species diversity problem. This is a notoriously difficult problem in all but the most simple cases [13]. The number of unobserved classes is non-identifiable without specific assumptions on the relative abundances of classes. No unbiased estimator can exist in the non- parametric setting [71], or even when considering just the possibility that the compound- ing distribution lies within one of two families [55]. The non-identifiability issue means that most research is in obtaining lower bounds to the total number of classes, e.g. [16, 20, 31, 18, 79] among others. Harris [52] pre- sented a general framework to obtain bounds on the number of unobserved but only provided a way to use a small subset of the information from the experiment. It is rea- sonable to believe that using more information should improve the bounds. Yet there currently exists no framework in which to do this. To obtain a framework to generalize the bounds of Harris, we reformulate the prob- lem as bounding an integral of a function over an unknown distribution given moments of the distribution. This problem has a solution in the numerical integration literature, 47 namely Gaussian quadrature [28]. We shall show how Gaussian quadrature gives a prac- tical method to apply the theory of Harris. We use it to obtain lower bounds using sig- nificantly more information than previously considered. This allows us to make tighter bounds. Consider the class of distributions with the first few moments known. By extending previous work [59, 19], Harris showed that the distribution with extreme values of the considered integral, are discrete. This implies the support points and weights of the extremal discrete distribution will satisfy a system of non-linear equations so that the moment constraints are met. In other words, the moments of the discrete distribution match the estimated moments. Analytic solutions for 2, 3, and 4 moments were derived but higher order analytic solutions are not possible since the solutions involve finding the roots of fifth or higher order polynomials. Yet using more moments means using more information and should give better bounds. First we shall show how Gaussian quadrature calculates the points and weights, called the quadrature rules, of the extremal discrete distributions. Furthermore the solu- tion is unique so that Gaussian quadrature rules correspond exactly to the bounds of Har- ris. We then discuss methods for calculating the quadrature rules. In particular we shall discuss algorithmic considerations brought about by extreme numerical ill-conditioning in the map from the observed moments to the quadrature rules. Finally we evaluate the method by simulating a sequencing experiment and comparing our new method to previous methods. 6.1 Model Recall the model introduced in section 2.1. The number of observations from class i = 1;:::;S at time t is denoted as y i (t) and i is the corresponding Poisson rate, 48 that is independently and identically distributed according to some distribution. We assume the observed sample corresponds to time t = 1. The count frequencies n j = P S i=1 1(y i (1) = j) are the number of classes observed j times in the sample. By the compound Poisson assumption, the expected value ofn j is E(n j ) =S Z 1 0 e j =j!d(): In particular the expected number of unobserved classes is E(n 0 ) = S R 1 0 e d(). Since the number of observed classes plus the number of unobserved equals the total number of classes, any estimate for n 0 immediately provides an estimate for S. An expression for E(n 0 ) that does not involveS automatically provides an estimate of the total number of classes. 6.1.1 Boundingthenumberofunobserved Following ideas introduced by Harris [52], Chao [16] defined the measure such that d(x)/xe x d(x). The moments of can be expressed as m = Z 1 0 x m d(x) = S Z 1 0 x m+1 e x d(x) . S Z 1 0 xe x d(x) = (m + 1)!E(n m+1 ) E(n 1 ): (6.1) If we express E(n 0 ) in terms ofd, then E(n 0 ) =S Z 1 0 xe x d(x) R 1 0 (1=x)xe x d(x) R 1 0 xe x d(x) = E(n 1 ) Z 1 0 1 x d(x): (6.2) 49 Estimatingn 0 is now transformed into estimating the integral on the right hand side of equation (6.2) given the moments of. We can now define the fundamental moment based problem: Estimate Z 1 0 1 x d(x) such that m ^ m = (m + 1)!n m+1 n 1 for m = 0; 1;:::; (6.3) where the moment estimates ^ i come via equation (6.1) with observedn i as unbiased estimates of E(n i ). Extreme values for the above integral, subject to the given moment constraints, can be substituted in equation (6.2) to give bounds on E(n 0 ). Extreme values satisfying (6.3) correspond to discrete distributions satisfying the moment constraints [60, 52]. Therefore the integral can be computed by identifying the set of positive weights fw 1 ;:::;w K g and points 0x 1 <<x K satisfying w 1 +::: +w K = 1; w 1 x 1 +::: +w K x K = 1 . . . w 1 x m 1 + +w K x m K = m : (6.4) Lower bounds are obtained by takingm to be odd andK = (m + 1)=2 [28]. 50 6.1.2 Gaussianquadrature AK-point discrete approximation to the integral Z 1 0 f(x)d(x) K X k=1 f(x k )w k is called Gaussian if the approximation is exact whenf is a polynomial of order 2K 1 or less. It is easy to see that this occurs if and only if the set of points,~ x, and weights, ~ w, solve the system of equations (6.4) withw k 0 for allk. To show that Gaussian quadrature and the bounds of Harris are identical we will show that the quadrature rules are unique. For allm > 1 and 0 < x 1 < x 2 < ::: < x m <1, the matrix 0 B B B B B B B @ 1 1 ::: 1 x 1 x 2 ::: x m . . . x m 1 x m 2 ::: x m m 1 C C C C C C C A is nonsingular for any integer m > 1. Therefore the systemfx k g k0 constitutes a Chebyshev system over the positive reals and the Gaussian quadrature rule for anyK > 0 exists and is unique [60, 77]. The error of theK-point Gaussian quadrature estimate with weightsw 1 ;:::;w K and pointsx 1 ;:::;x K to the functionf(x) is given by Z 1 0 f(x)d(x) K X k=1 w k f x k = f (2K) (y) (2K)! Z 1 0 K Y k=1 xx k 2 d(x) for some 0 < y <1 [42]. Since the function 1 x has strictly positive derivatives for all even orders, the error is positive. Therefore theK-point Gaussian quadrature esti- mate is a lower bound. Simple calculation shows that the single point bound (n 2 1 =2n 2 ) 51 0 2 4 6 8 0 0.2 0.4 0.6 0.8 2 point 3 point 4 point 5 point 0 1 2 3 4 5 2 point 3 point 4 point 5 point 6 point 7 point 0 2 4 6 8 0 0.2 0.4 0.6 0.8 0 2 4 6 8 0 0.2 0.4 0.6 0.8 0 2 4 6 8 0 0.2 0.4 0.6 0.8 a b Unobserved classes (millions) Figure 6.1: Class counts follow a Negative Binomial distribution with mean = 1 and shaper = 1, andd(x) is a Gamma distribution with shape parameter 1 and scale parameter 1. (a) Discrete approximations determined by the Gaussian quadrature rules and (b) quadrature estimates when using the expected counts for 10 million distinct classes and 5 million unobserved classes. and the two point bound are identical to that calculated by Harris and the former later rediscovered by Chao [15]. In the interpretation of Harris, we are approximating with a discrete distribution based on the observed moments. Sinced(x)/xe x d(x), the support of and are identical. We can therefore interpret the quadrature rules as a discrete approximation to (Figure 6.1a). 52 6.2 CalculatingGaussianquadraturerules 6.2.1 Monic orthogonal polynomials and the associated three term recurrence Given the measure, define the inner product for integrable functionsp(x) andq(x), (p;q) = Z 1 0 p(x)q(x)d(x): This also naturally defines the normkpk 2 = R 1 0 p 2 (x)d(x). Associated with the measure are a sequence of monic polynomialsf l (x); l = 0; 1;:::g such that l (x) = x l + lower order terms. These polynomials satisfy a three term recurrence l+1 (x) = (x l ) l (x) l l1 (x) (6.5) with initial conditions 0 (x) = 1 and 1 (x) = 0 [38]. Additionally l = (x l (x); l (x))=k l k 2 and l =k l k 2 =k l1 k 2 with the convention that 0 1. These sequences allow us to construct the positive-definite symmetric tridiagonal Jacobi matrix associated with, J = 0 B B B B B B B @ 0 p 1 0 0 ::: p 1 1 p 2 0 ::: 0 p 2 2 p 3 ::: . . . . . . . . . 1 C C C C C C C A : We shall denote the square Jacobi matrix truncated to an (m + 1) (m + 1) matrix by J m . 53 6.2.2 Convergenceofquadratureestimates The sequence 0 ; 1 ;::: is a non-negative sequence. That is for anyx 0 ;x 1 ;::: the sum P m l;k=0 l+k x l x k 0 for all integersm. This is equivalent to the non-negative positive definiteness of the associated Hankel matrices 0 B B B B B B B @ 0 1 ::: m 1 2 ::: m+1 . . . . . . . . . m m+1 ::: 2m 1 C C C C C C C A : (6.6) If, and equivalently, is discrete withM support points then the Hankel matrix of dimension 2M is singular and the orthogonal polynomials exist only up to degree M 1 [98]. Additionally if is assumed to have no weight on zero, i.e. support strictly in (0;1), then the quadrature rules withM points will be exact. If is assumed to have infinite support then the Hankel matrices, equation 6.6, are positive definite for any m > 0. Furthermore the orthogonal polynomials will exist for all orders. In this case we may be concerned with convergence of the quadrature estimates to the true integral (6.2) when the expected moments of are observed. We assume thatL is finite so that the integral (6.2) is finite. To ensure convergence we need the moment problem associated with to be deter- minate, i.e. is uniquely determined by its momentsf m g m0 . It is clear that E(n m ) S and we can bound m by C(m + 1)!, for some constant C that depends only on S and . This implies that the growth of the moments is not too severe and since has infinite support, inherited from , the moments do not vanish and are not infinite. Therefore the moment problem associated with is determinate [93]. This implies the convergence of integrals of bounded functions [3]. If we take the convention 54 thatf(x) = 1=x forx2 (0;1) andf(x) = 0 otherwise then the quadrature estimates converge to truen 0 if the expected moments are observed [27] (Figure 6.1b). 6.2.3 Analyticexample: NegativeBinomialcounts We want a case where the three term recurrences are known and do not need to be constructed numerically so that we can evaluate our algorithmic approaches. Suppose that is a Gamma distribution with shape parameterr and scale parameterp=(1p), i.e. d(x) = 1 p 1p (r) x r1 e x 1p p : This implies that the number of observations from a random class follows a Negative Binomial distribution with success probabilityp and sizer. In this cased(x) will be proportional tox r e x with = 1=p. We note the similarity of the above measure to the measure x r e x that generates the monic associated Laguerre polynomials [98]. This allows us to make an intelligent guess as to the form of the monic polynomial orthogonal tox r e x and the associated three term recurrence, given in lemmas 2 and 3 in appendix A.2. The monic polynomials are given by L (r) n (x) = n X l=0 n +r nl n! l! ln (1) n+l x l satisfying the three term recurrence (6.5) with n = 2n + 1 +r and n = (n +r)n 2 : 55 6.2.4 CalculatingquadraturerulesfromtheJacobimatrix Given the truncated Jacobi matrix,J K , the pointsx 1 ;:::;x K are the eigenvalues ofJ K and weightsw 1 ;:::;w K are the square of the first entry of eigenvectors [43]. The QR algorithm provides a way to estimate both the eigenvalues and eigenvectors simultane- ously [109]. SinceJ K is already tridiagonal, the usual Householder transformations are not required. Furthermore the full eigenvector is not needed so we can simply updateJ K and the first entry of the eigenvector at each iteration, without storing the full vectors, giving a linear time algorithm (algorithm 2) [43]. 6.2.5 Calculatingthethreetermrecurrence The process of mapping the moments to the three term recurrence coefficients is known to be generally ill-conditioned [37]. Small variations in the moments can lead to large variations in the estimated three term recurrence. Golub & Welsch [43] give an algo- rithm with time complexity cubic in the number of moments based on a Cholesky decomposition of the Hankel matrix given in equation (6.6). Hankel matrices are known to have spectral condition number bounded below by 2 K6 [103]. This method is accu- rate for smallK but is unstable for larger values ofK due to the random nature of the estimated moments. More general methods have been suggested that rely upon a transformation of the known moments to so called modified moments [110, 89, 36]. The modified moments are calculated by evaluating orthogonal polynomials to a known measure, hopefully related to the unknown distribution, on the unknown measure. The simplest of these algorithms is the modified Chebyshev algorithm (algorithm 3) [42], that has a time complexity that is quadratic in the number of moments. 56 Input: Tridiagonal matrix with positive~ a = (a 0 ;:::;a m ) on diagonal and positive ~ b = (b 1 ;:::;b m ) on the off-diagonals and numerical tolerance"> 0 Output: Eigenvalues, 0 ;:::; m and first entries of the eigenvectors,z 0 ;:::;z m . 1: ~ z 1 0 (1; 0;:::; 0) 2: ~ 1 ~ a 3: while P m k=1 b l k >"do 4: d l 0 b l 1 5: b l 0 l 0 6: l 0 l 0 7: ~ b l 1 b l 1 8: z l 0 z l 0 9: fork = 0; 1;:::;m 1do 10: sin l k d l k1 = q d l k1 2 + b l k1 2 11: cos l k b l k1 = q d l k1 2 + b l k1 2 12: l+1 k l k cos l k 2 + 2 ~ b l k cos l k sin l k + l k+1 sin l k 2 13: l k+1 l k sin l k 2 2 ~ b l k cos l k sin l k + l k+1 cos l k 2 14: ifk> 0then 15: b k q d l k 2 + b l k 2 16: endif 17: b l k+1 a l k a l k+1 cos l k sin l k + ~ b l k sin l k 2 cos l k 2 18: ~ b l k+1 b l k+1 cos l k 19: d l k+1 b l k+1 sin l k 20: z l k z l k cos l k +z l1 k sin l k 21: z l k+1 z l k sin l k z l1 k cos l k 22: endfor 23: l m l m 24: b l m b l m 25: z l m z l m 26: l l + 1 27: endwhile Algorithm2: Golub & Welsch’s [43] linear time QR algorithm The modified Chebyshev algorithm takes as input a known measure and the asso- ciated monic orthogonal polynomials,fp k (x)g k0 with known three term recurrence fa k g k0 andfb k g k0 . Define the modified momentsm k = R 1 0 p k (x)d(x) fork 0. 57 Input: Modified moments,m 0 ;m 1 ;:::;m K1 of a known measure against the unknown measure. Output: Estimated three term recurrence 0 ;:::; K1 and 1 ;:::; K1 . 1: 0 a 0 +m 1 =m 0 , 2: 0 m 0 3: forl = 1; 2;:::; 2K 2do 4: 1;l 0 5: forl = 0; 1;:::; 2K 1do 6: 0;l m l 7: endfor 8: endfor 9: fork = 1;:::;K 1do 10: forl =k;k + 1;:::; 2Kk 1do 11: k;l k1;l+1 ( k1 a l ) k1;l k1 k2;l +b l k1;l1 12: endfor 13: k a k + k;k+1 k;k k1;k k1;l1 14: k k;k k1;k1 15: endfor Algorithm3: The modified Chebyshev algorithm These can be calculated as simply the sum of the coefficients ofp k (x) times the corre- sponding moment of. If the modifying monic orthogonal polynomials are chosen to be the trivial powers of x, i.e.p k (x) =x k , then the associated three term recurrence is given by~ a = (1; 0; 0;:::) and ~ b = (0; 0;:::). The modified moments are simply the moments of. By applying algorithm 3 above one will obtain the original Chebyshev algorithm [101], which we will refer to as the unmodified Chebyshev algorithm. Using the unmodified Chebyshev algorithm we can calculate 0 = 2n 2 =n 1 . Since J 1 is a 1-dimensional matrix with entry and eigenvalue equal to 0 , the quadrature rule for a single point will put all weight on the point 0 . This gives the one point quadrature rule asn 1 = 0 =n 2 1 =2n 2 , that is exactly the lower bound given by Chao [15]. 58 0.99 1 1.01 1.9 2 2.1 1 3 5 −20 0 20 0.99 1 1.01 1.9 2 2.1 1 3 5 −20 0 20 2 3 4 2 3 4 2 4 6 8 2 4 6 8 number of points number of points a c b d 1 0.987 0.195 1 1 0.732 Figure 6.2: For one thousand samples of ten million classes with counts Negative Bino- mially distributed with mean 1 and size 1, first four estimated values of using modified Chebyshev algorithm with the true measure as the modifying measure (a) and unmodi- fied Chebyshev (b) along with the corresponding quadrature estimates and their success rate (c andd). Dashed lines indicate the true value. . Choiceofmodifyingmeasure The choice of the modifying measure is extremely important in the modified Chebyshev algorithm to reduce the ill-conditioning in estimating the three term recurrence [38]. Theoretically, if the modifying measure is exactly the unknown measure then the modified Chebyshev algorithm recovers the unknown measure exactly. Therefore, it is suggested that the modifying measure be chosen such that it is close to the unknown measure with the same support [7]. The effect of the choice of the modifying measure with stochastic observed moments in estimating the three term recurrence is unknown. To study the effectiveness of the modifying measure we take 1000 independent samples of one million Negative Bino- mially distributed counts with mean = 1 and size = 1. The three term recurrence is computed using the unmodified Chebyshev algorithm and the modified Chebyshev algo- rithm, with the modifying measure being the true measure (d/ xe 2x ) and the true 59 three term recurrence given in lemma 3. The first twos computed are identical, but for larger terms the modified Chebyshev algorithm exhibits more instability and variance (Figure 6.2a andb). This instability can only be exacerbated when the true distribution is significantly different from the modifying measure. The instability of calculating the three term recurrence leads to instability in calculating the quadrature rules. Since every term in the three term recurrence must be positive, we will truncate the three term recurrence to right before a negative term in~ or ~ . Additionally, small values of~ and ~ lead to smaller eigenvalues of the Jacobi matrix, that will be inverted in the approximated inte- gral. Therefore extremely small values will translate into extremely large values in the estimated integral. We see that the performance of the unmodified Chebyshev algorithm is superior to the modified Chebyshev, even when the true measure is known, leading to more stable estimates (Figure 6.2c andd). 6.2.6 Baggingandvariancestabilization We would like to bootstrap the observed counts histogram to minimize the effect of the ill-conditioning problem in estimating the three term recurrence. We first note that since the quadrature estimate is a linear combination of one over the estimated quadrature points, it is mostly determined by the smallest point, x 1 , in the quadrature rules. The points are the eigenvalues of the Jacobi matrix and the sum of the eigenvalues is equal to the trace of the Jacobi matrix, that is equal to the sum of the estimateds. Obviously the exact relationship is highly non-linear, but this indicates an approximately inverse relationship between the estimated ~ and the quadrature estimate for the number of unobserved. This leads to the extreme variance we observe in the quadrature estimates (Figure 6.2 c and d). To reduce the effect of the ill-conditioning of the estimated three 60 3 4 5 3 4 5 0 2 4 6 8 10 0 2 4 6 8 10 number of points number of unobserved (millions) Figure 6.3: Boxplot of bootstrapped arithmetic mean (a) and geometric mean (b) from 100 independent samples of 10 million Negative Binomial(1; 1) distributed counts term recurrence, we propose bootstrapping the counts histogram and taking the geomet- ric mean of the bootstrapped quadrature estimates to improve the final estimate for the number of unobserved classes. To test bootstrapping taking the geometric mean to reduce variance we take 100 inde- pendent samples of 10 million Negative Binomial counts with mean equal to 1 and size equal to 1. The histogram was bootstrapped 100 times and we compare the mean and logarithmic mean of the bootstrapped quadrature estimates (Figure 6.3). We find that taking the geometric mean significantly improves the quadrature estimates. It dimin- ishes the effect of the ill-conditioning in estimating the three term recurrence, reducing the standard deviation by up to 88% (for 4 points). Furthermore we see the distribution of the quadrature estimates is significantly closer to the theoretical quadrature estimate with the geometric mean than the arithmetic mean. 61 6.3 Comparing lower bounds for a simulated sequenc- ingexperiment Our interest in this problem originated in estimating transcriptomic diversity, the num- ber of distinct transcripts that are expressed in a cell or cell type. Recent deep and targeted RNA sequencing experiments [82, 50] have revealed the presence of rare func- tional transcripts that has brought into question our understanding of transcriptomic diversity [21, 58]. Extremely deep sequencing may be required to identify rare transcripts, that will require large resources. Instead we can investigate transcriptional diversity quantita- tively and estimate the total number of transcripts. This immediately brings the parallel to the species diversity problem and we shall apply it to a simulated RNA sequencing experiment to evaluate the performance of many different estimators, including the one presented here. A major difference is that most of these estimators were derived in rela- tively small experiments, with hundreds or maybe thousands of observations while RNA sequencing is in the millions. 6.3.1 SimulatingaRNA-seqexperimentwithknownproperties To simulate the RNA sequencing experiment we use the same method described in sec- tion 5.2. We take an ultra deep RNA sequencing experiment, 1,196,520,129 single end 101 nucleotide mRNA-seq reads from a mouse tumor mapped with Tophat2 version 2.0.6 [62] to the mouse genome assembly mm9. This gives 1,021,240,281 uniquely mapped reads. We investigate two quantities: the number of distinct reads as deter- mined by start position in the genome and the number of observed exons with at least one read overlapping them. Exon annotation was from the UCSC genome browser (https://genome.ucsc.edu/). Note that one read can overlap more than one 62 exon due to exon-exon junctions [102]. For each simulation we sample reads with replacement from the unmapped library, map them, and determine the exon overlaps with BEDTools [87]. The library contains 52,787,667 distinct reads with 20,812,470 singletons. The sin- gletons will be the lowest abundance reads in our simulations, with each having relative abundance of approximately one in a billion. The most represented read is observed 2,100,157 times in the original experiment, so that its relative abundance is approxi- mately 0:2% of the simulated library. We see this simulated library is extremely non- uniform with a coefficient of variation of 30:74. Many methods may have problems with such a large deviation from uniformity as they were derived and tested on far more homogeneous simulated experiments, i.e. the largest coefficient of variation tested in a recent paper [20] is 4:2. The total number of exon overlaps by the reads is 1,626,666,187 with 6,464 exons having a single read overlap. The highest frequency exon has 56,447,296 reads over- lapping it, giving it a relative abundance of approximately 3:5%. Despite the larger difference between the highest and least represented class, the coefficient of variation (16:87) is smaller than when considering distinct reads. 6.3.2 Comparingestimates We compare estimates and lower bounds for the total number of classes with 11 methods given in table 6.1. Recall that Chao1984 corresponds exactly to the one point quadrature estimate. All of the non-quadrature based estimators were implemented with default options using the R package SPECIES [106], excluding iChao2014, that we imple- mented in R, and BungeWillis, implemented in the R package breakaway. We simulated 100 independent experiments for each of 1, 2, and 5 million (M) sequenced reads. The exact number of observations in each experiment is random due to 63 Name Description Quad3 3 points quadrature Quad4 4 point quadrature Quad5 5 points quadrature Chao1984 lower bound of Chao [15] iChao2014 recent improved version of Chao’s lower bound [15] ChaoBunge Gamma-Poisson based coverage estimator of Chao & Bunge [17] ChaoLee coverage based estimator of Chao & Lee [18] Jackknife jackknife estimator of Burnham & Overton [14] pNPMLE penalized non-parametric MLE of Wang & Lindsay [107] NPMLE non-parametric MLE of Norris & Pollock [85] BungeWillis recent frequency ratio regression estimator of Bunge & Willis [111] Table 6.1: Methods for estimating the number of unobserved classes unmappable reads in the original library and the random nature of exon overlap. Along with the average estimate of the total number of distinct reads and exons contained in the library, the standard error (se) and the root mean square error (RMSE) are used to com- pare methods. We shall refer to the problem of estimating the total number of distinct reads as the reads problem and the problem of estimating the total number of expressed exons as the exon problem. First we observe that ChaoBunge sometimes gives negative estimates. This was previously observed as possibly occurring in highly heterogeneous populations [111]. This occurs in every single case for the more heterogeneous read counts (Figure 6.4), but in none for the exon counts (Figure 6.5). The reason for this behavior is not clear, but this makes application of the ChaoBunge estimator problematic. In both the covered exons and the distinct reads cases the quadrature esti- mates improve significantly upon the other moment based estimators, Chao1984 and iChao2014. We expect to improve significantly on the Chao1984 estimates since they correspond to the 1 point quadrature estimates. What is surprising is the dramatic improvement of even Quad3 over iChao2014, with an increase of over 50% for all initial 64 sample sizes in the reads problem. The cost of this improvement is significantly larger standard error, but the improvement is large enough to improve the RMSE. In the reads problem the most accurate in all initial sample sizes is Quad5, but at a cost of being one of the most variable estimates. This still results in it having the lowest RMSE of all estimators. In the exons problem we see that the best estimator, both in terms of the closest average estimate and RMSE, in the 1M initial sample case is pNPMLE. The pNPMLE also has good performance for the 2M initial sample, losing only to Quad4 in terms of average estimate and RMSE. In the 5M initial sample, though, the pNPMLE is outper- formed by all the quadrature estimators. We also see in this case the instability problems of the NPMLE that prompted the development of the penalized NPMLE [107]. Overall the quadrature estimators are among the best, if not the best, performing estimators in terms of accuracy and RMSE. Moving from 3 to 5 points results in signif- icantly more variation in the estimates, sometimes an order of magnitude large, but this typically results in more accurate estimates. It seems that the higher point quadrature estimators perform better with more heterogeneity, as there is more information in the larger moments and frequencies. 65 Method 1M mean 1M sd 1M RMSE 2M mean 2M sd 2M RMSE 5M mean 5M sd 5M RMSE Chao1984 3.004E+06 1.179E+04 4.978E+07 4.406E+06 1.195E+04 4.838E+07 7.161E+06 1.332E+04 4.563E+07 iChao2014 3.252E+06 1.210E+04 4.954E+07 4.811E+06 1.218E+04 4.798E+07 7.873E+06 1.432E+04 4.491E+07 ChaoBunge -3.016E+05 2.720E+03 5.309E+07 -6.358E+05 3.718E+03 5.342E+07 -1.755E+06 6.867E+03 5.454E+07 ChaoLee 6.791E+06 4.624E+04 4.600E+07 9.402E+06 3.935E+04 4.339E+07 1.386E+07 3.536E+04 3.893E+07 Jackknife 2.675E+06 3.944E+03 5.011E+07 4.513E+06 5.617E+03 4.828E+07 8.466E+06 9.894E+03 4.432E+07 pNPMLE 5.596E+06 3.585E+05 4.719E+07 8.641E+06 4.034E+05 4.415E+07 1.475E+07 5.960E+05 3.804E+07 NPMLE 6.601E+06 1.967E+06 4.623E+07 9.553E+06 1.091E+06 4.325E+07 1.572E+07 1.050E+06 3.708E+07 WillisBunge 2.798E+06 2.559E+06 5.005E+07 6.108E+06 1.969E+07 5.062E+07 6.484E+06 1.285E+07 4.804E+07 Quad3 5.743E+06 9.081E+05 4.705E+07 7.966E+06 6.249E+05 4.483E+07 1.235E+07 4.401E+05 4.044E+07 Quad4 6.904E+06 1.314E+06 4.590E+07 9.975E+06 1.868E+06 4.285E+07 1.611E+07 3.198E+06 3.682E+07 Quad5 8.993E+06 2.620E+06 4.387E+07 1.358E+07 3.650E+06 3.938E+07 2.000E+07 4.559E+06 3.310E+07 Figure 6.4: Comparison of estimators of the total number of reads using average estimates, standard error, and root mean square error for initial sample sizes of 1M, 2M, and 5M reads. 66 Method 1M mean 1M sd 1M RMSE 2M mean 2M sd 2M RMSE 5M mean 5M sd 5M RMSE Chao1984 1.291E+05 6.861E+02 8.878E+04 1.451E+05 5.742E+02 7.276E+04 1.628E+05 1.861E+03 5.51E+04 iChao2014 1.345E+05 7.763E+02 8.339E+04 1.495E+05 6.544E+02 6.836E+04 1.661E+05 1.769E+03 5.18E+04 ChaoBunge 1.372E+05 8.620E+02 8.061E+04 1.476E+05 5.974E+02 7.028E+04 1.625E+05 1.570E+03 5.54E+04 ChaoLee 1.327E+05 6.547E+02 8.513E+04 1.456E+05 5.107E+02 7.224E+04 1.616E+05 1.671E+03 5.62E+04 Jackknife 1.561E+05 2.920E+03 6.180E+04 1.677E+05 2.986E+03 5.022E+04 1.806E+05 2.906E+03 3.74E+04 pNPMLE 1.614E+05 9.886E+03 5.726E+04 1.673E+05 1.042E+04 5.158E+04 1.755E+05 7.533E+03 4.31E+04 NPMLE 2.099E+07 1.186E+08 1.198E+08 4.952E+06 4.747E+07 4.747E+07 1.872E+05 8.005E+04 8.54E+04 WillisBunge 1.457E+05 1.062E+04 7.291E+04 1.634E+05 2.539E+04 6.001E+04 1.724E+05 1.322E+04 4.73E+04 Quad3 1.533E+05 2.071E+04 6.776E+04 1.660E+05 1.855E+04 5.506E+04 1.774E+05 1.077E+04 4.19E+04 Quad4 1.597E+05 1.880E+04 6.105E+04 1.719E+05 2.041E+04 5.021E+04 1.833E+05 1.285E+04 3.68E+04 Quad5 1.563E+05 4.729E+04 7.751E+04 1.664E+05 2.944E+04 5.917E+04 1.845E+05 4.280E+04 5.41E+04 Figure 6.5: Comparison of estimators of the total number of expressed exons using average estimates, standard error, and root mean square error for initial sample sizes of 1M, 2M, and 5M reads. 67 ReferenceList [1] An Iintroduction to Next-Generation Sequencing Technology. http: //res.illumina.com/documents/products/illumina_ sequencing_introduction.pdf. Accessed: August 31, 2014. [2] Estimating Sequencing Coverage. http://res.illumina.com/ documents/products/technotes/technote_coverage_ calculation.pdf. Accessed: August 5, 2014. [3] Naum Akhiezer. The classical moment problem: and some related questions in analysis, volume 5. Oliver & Boyd London, 1965. [4] George A Baker Jr. Defects and the convergence of Pad´ e approximants. Acta Applicandae Mathematica, 61(1-3):37–52, 2000. [5] George A Baker Jr and Peter Graves-Morris. Pad´ e approximants, volume 59. Cambridge University Press, 1996. [6] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455–477, 2012. [7] Bernhard Beckermann and Emmanuel Bourreau. How to choose modified moments? Journal of computational and applied mathematics, 98(1):81–98, 1998. [8] Jennifer Benichou, Rotem Ben-Hamo, Yoram Louzoun, and Sol Efroni. Rep- seq: uncovering the immunological repertoire through next-generation sequenc- ing. Immunology, 135(3):183–191, 2012. [9] Peter J Bickel and Joseph A Yahav. On estimating the total probability of the unobserved outcomes of an experiment. Lecture Notes-Monograph Series, pages 332–337, 1986. 68 [10] Paul C Blainey. The future is now: single-cell genomics of bacteria and archaea. FEMS microbiology reviews, 2013. [11] Gertrude Blanch. Numerical evaluation of continued fractions. Siam Review, 6(4):383–421, 1964. [12] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [13] John Bunge and M Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993. [14] Kenneth P Burnham and Walter Scott Overton. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 65(3):625–633, 1978. [15] Anne Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics, pages 265–270, 1984. [16] Anne Chao. Estimating the population size for capture-recapture data with unequal catchability. Biometrics, 43(4):783–791, 1987. [17] Anne Chao and John Bunge. Estimating the number of species in a stochastic abundance model. Biometrics, 58(3):531–539, 2002. [18] Anne Chao and Shen-Ming Lee. Estimating the number of classes via sample coverage. Journal of the American statistical Association, 87(417):210–217, 1992. [19] Herman Chernoff and Stanley Reiter. Selection of a distribution function to min- imize an expectation subject to side conditions, 1954. [20] Chun-Huo Chiu, Yi-Ting Wang, Bruno A Walther, and Anne Chao. An improved nonparametric lower bound of species richness via a modified Good–Turing fre- quency formula. Biometrics, 2014. [21] Michael B Clark, Paulo P Amaral, Felix J Schlesinger, Marcel E Dinger, Ryan J Taft, John L Rinn, Chris P Ponting, Peter F Stadler, Kevin V Morris, Antonin Morillon, et al. The reality of pervasive transcription. PLoS biology, 9(7):e1000625, 2011. [22] Robert K Colwell, Chang Xuan Mao, and Jing Chang. Interpolating, extrapo- lating, and comparing incidence-based species accumulation curves. Ecology, 85(10):2717–2727, 2004. [23] Ralph B D’Agostino. Transformation to normality of the null distribution of g1. Biometrika, 57(3):679–681, 1970. 69 [24] Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature methods, 10(4):325–327, 2013. [25] Timothy Daley and Andrew D Smith. Modeling genome coverage in single cell sequencing. Bioinformatics, 2014. [26] Charles S Davis. The computer generation of multinomial random variates. Com- putational statistics & data analysis, 16(2):205–217, 1993. [27] Philip J Davis and Philip Rabinowitz. Ignoring the singularity in approximate integration. Journal of the Society for Industrial & Applied Mathematics, Series B: Numerical Analysis, 2(3):367–383, 1965. [28] Philip J Davis and Philip Rabinowitz. Methods of numerical integration. Courier Dover Publications, 2007. [29] Thomas Derrien, Jordi Estell´ e, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guig´ o, and Paolo Ribeca. Fast computation and appli- cations of genome mappability. PloS one, 7(1):e30377, 2012. [30] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. [31] Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3):435–447, 1976. [32] Thore Egeland and Antonio Salas. Estimating haplotype frequency and coverage of databases. PloS one, 3(12):e3988, 2008. [33] Stefano Favaro, Antonio Lijoi, and Igor Pr¨ unster. A new estimator of the discov- ery probability. Biometrics, 68(4):1188–1196, 2012. [34] Ronald Aylmer Fisher, A Steven Corbet, and Carrington B Williams. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, pages 42–58, 1943. [35] Mark Galassi and Brian Gough. Gnu scientific library: reference manual, 2006. [36] Walter Gautschi. On the construction of Gaussian quadrature rules from modified moments. Math. Comp, 24(110):245–260, 1970. [37] Walter Gautschi. On generating orthogonal polynomials. SIAM Journal on Sci- entific and Statistical Computing, 3(3):289–317, 1982. 70 [38] Walter Gautschi. Orthogonal Polynomials: Computation and Approximation, Numerical Mathematics and Scientific Computation Series. Oxford University Press, Oxford, 2004. [39] Jacek Gilewicz and Yuri Kryakin. Froissart doublets in Pad´ e approximation in the case of polynomial noise. Journal of computational and applied mathematics, 153(1):235–242, 2003. [40] Jacek Gilewicz and Maciej Pindor. Pad´ e approximants and noise: rational func- tions. Journal of computational and applied mathematics, 105(1):285–297, 1999. [41] Jeff Gole, Athurva Gore, Andrew Richards, Yu-Jui Chiu, Ho-Lim Fung, Diane Bushman, Hsin-I Chiang, Jerold Chun, Yu-Hwa Lo, and Kun Zhang. Massively parallel polymerase cloning and genome sequencing of single cells using nano- liter microwells. Nature biotechnology, 2013. [42] Gene H Golub and G´ erard Meurant. Matrices, moments and quadrature with applications. Princeton University Press, 2010. [43] Gene H Golub and John H Welsch. Calculation of Gauss quadrature rules. Math. Comp, 23(106):221–230, 1969. [44] IJ Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237–264, 1953. [45] IJ Good. The use of divergent series in statistical problems. Journal of statistical computation and simulation, 34(2-3):164–164, 1990. [46] IJ Good. Turing’s anticipation of empirical Bayes in connection with the crypt- analysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–111, 2000. [47] IJ Good and GH Toulmin. The number of new species, and the increase in popu- lation coverage, when a sample is increased. Biometrika, 43(1-2):45–63, 1956. [48] Roy G Gordon. Error bounds in equilibrium statistical mechanics. Journal of Mathematical Physics, 9(5):655–663, 1968. [49] Stefan Gr¨ oschel, Mathijs A Sanders, Remco Hoogenboezem, Elzo de Wit, Britta AM Bouwman, Claudia Erpelinck, Vincent HJ van der Velden, Mar- ije Havermans, Roberto Avellino, Kirsten van Lom, et al. A single onco- genic enhancer rearrangement causes concomitant evi1 and gata2 deregulation in leukemia. Cell, 157(2):369–381, 2014. 71 [50] Matthew J Hangauer, Ian W Vaughn, and Michael T McManus. Pervasive tran- scription of the human genome produces thousands of previously unidentified long intergenic noncoding rnas. PLoS genetics, 9(6):e1003569, 2013. [51] Godfrey Harold Hardy. Divergent series, volume 334. American Mathematical Soc., 1991. [52] Bernard Harris. Determining bounds on integrals with applications to cataloging problems. The Annals of Mathematical Statistics, 30(2):521–548, 1959. [53] Peter Henrici. The quotient-difference algorithm. NBS Appl. Math. Ser, 49:23– 46, 1958. [54] Emily Hodges, Antoine Molaro, Camila O Dos Santos, Pramod Thekkat, Qiang Song, Philip J Uren, Jin Park, Jason Butler, Shahin Rafii, W Richard McCom- bie, et al. Directional DNA methylation changes and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment. Molecu- lar cell, 44(1):17–28, 2011. [55] Hajo Holzmann, Axel Munk, and Walter Zucchini. On identifiability in capture– recapture models. Biometrics, 62(3):934–936, 2006. [56] Seiyu Hosono, A Fawad Faruqi, Frank B Dean, Yuefen Du, Zhenyu Sun, Xiao- hong Wu, Jing Du, Stephen F Kingsmore, Michael Egholm, and Roger S Lasken. Unbiased whole-genome amplification directly from clinical samples. Genome Research, 13(5):954–964, 2003. [57] Iuliana Ionita-Laza, Christoph Lange, and Nan M Laird. Estimating the number of unseen variants in the human genome. Proceedings of the National Academy of Sciences, 106(13):5008–5013, 2009. [58] Young Seok Ju, Jong-Il Kim, Sheehyun Kim, Dongwan Hong, Hansoo Park, Jong-Yeon Shin, Seungbok Lee, Won-Chul Lee, Sujung Kim, Saet-Byeol Yu, et al. Extensive genomic and transcriptional diversity identified through mas- sively parallel DNA and RNA sequencing of eighteen korean individuals. Nature genetics, 43(8):745–752, 2011. [59] Samuel Karlin and Lloyd S Shapley. Geometry and moment spaces. 1952. [60] Samuel Karlin and William J Studden. Tchebycheff systems: With applications in analysis and statistics, volume 376. Interscience Publishers New York, 1966. [61] Kim A Keating, James F Quinn, Michael A Ivie, and LaDonna L Ivie. Esti- mating the effectiveness of further sampling in species inventories. Ecological Applications, 8(4):1239–1249, 1998. 72 [62] Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg. TopHat2: accurate alignment of transcriptomes in the pres- ence of insertions, deletions and gene fusions. Genome Biol, 14(4):R36, 2013. [63] Teemu Kivioja, Anna V¨ ah¨ arautio, Kasper Karlsson, Martin Bonke, Martin Enge, Sten Linnarsson, and Jussi Taipale. Counting absolute numbers of molecules using unique molecular identifiers. Nature methods, 9(1):72–74, 2012. [64] Martin Knott. Models for cataloguing problems. The Annals of Mathematical Statistics, 38(4):1255–1260, 1967. [65] Donald E Knuth. The art of computer programming, volume 2: seminumerical algorithms. Reading, Ma, 1997. [66] Daniel C Koboldt, Qunyuan Zhang, David E Larson, Dong Shen, Michael D McLellan, Ling Lin, Christopher A Miller, Elaine R Mardis, Li Ding, and Richard K Wilson. Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research, 22(3):568–576, 2012. [67] Ian Kroes, Paul W Lepp, and David A Relman. Bacterial diversity within the human subgingival crevice. Proceedings of the National Academy of Sciences, 96(25):14547–14552, 1999. [68] Eric S Lander and Michael S Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–239, 1988. [69] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357–359, 2012. [70] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, 2010. [71] William A Link. Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics, 59(4):1123–1130, 2003. [72] Ryan Lister, Eran A Mukamel, Joseph R Nery, Mark Urich, Clare A Puddifoot, Nicholas D Johnson, Jacinta Lucero, Yun Huang, Andrew J Dwork, Matthew D Schultz, et al. Global epigenomic reconfiguration during mammalian brain devel- opment. Science, 341(6146), 2013. [73] Ryan Lister, Mattia Pelizzola, Yasuyuki S Kida, R David Hawkins, Joseph R Nery, Gary Hon, Jessica Antosiewicz-Bourget, Ronan OMalley, Rosa Castanon, Sarit Klugman, et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature, 471(7336):68–73, 2011. 73 [74] Manuel E Lladser, Ra´ ul Gouet, and Jens Reeder. Extrapolation of urn models via poissonization: Accurate measurements of the microbial unknown. PloS one, 6(6):e21105, 2011. [75] Ditte Lovatt, Brittani K Ruble, Jaehee Lee, Hannah Dueck, Tae Kyung Kim, Stephen Fisher, Chantal Francis, Jennifer M Spaethling, John A Wolf, M Sean Grady, et al. Transcriptome in vivo analysis (TIV A) of spatially defined single cells in live tissue. Nature methods, 2014. [76] Sijia Lu, Chenghang Zong, Wei Fan, Mingyu Yang, Jinsen Li, Alec R Chapman, Ping Zhu, Xuesong Hu, Liya Xu, Liying Yan, et al. Probing meiotic recombina- tion and aneuploidy of single sperm cells by whole-genome sequencing. Science, 338(6114):1627–1630, 2012. [77] J Ma, V Rokhlin, and Stephen Wandzura. Generalized Gaussian quadrature rules for systems of arbitrary functions. SIAM Journal on Numerical Analysis, 33(3):971–996, 1996. [78] Chang Xuan Mao. Predicting the conditional probability of discovering a new class. Journal of the American Statistical Association, 99(468), 2004. [79] Chang Xuan Mao. Inference on the number of species through geometric lower bounds. Journal of the American Statistical Association, 101(476), 2006. [80] Georgi K Marinov, Brian A Williams, Ken McCue, Gary P Schroth, Jason Gertz, Richard M Myers, and Barbara J Wold. From single-cell to cell-pool transcrip- tomes: Stochasticity in gene expression and RNA splicing. Genome research, 24(3):496–510, 2014. [81] JH McCabe. The quotient-difference algorithm and the Pad´ e table: an alter- native form and a general continued fraction. Mathematics of Computation, 41(163):183–197, 1983. [82] Tim R Mercer, Daniel J Gerhardt, Marcel E Dinger, Joanna Crawford, Cole Trap- nell, Jeffrey A Jeddeloh, John S Mattick, and John L Rinn. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nature biotechnology, 30(1):99–104, 2012. [83] Antoine Molaro, Emily Hodges, Fang Fang, Qiang Song, W Richard McCombie, Gregory J Hannon, and Andrew D Smith. Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell, 146(6):1029– 1041, 2011. [84] Xiaohui Ni, Minglei Zhuo, Zhe Su, Jianchun Duan, Yan Gao, Zhijie Wang, Chenghang Zong, Hua Bai, Alec R Chapman, Jun Zhao, et al. Reproducible 74 copy number variation patterns among single circulating tumor cells of lung can- cer patients. Proceedings of the National Academy of Sciences, 110(52):21083– 21088, 2013. [85] James L Norris and Kenneth H Pollock. Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environmental and Ecological Statistics, 5(4):391–402, 1998. [86] Robert Pinard, Alex de Winter, Gary J Sarkis, Mark B Gerstein, Karrie R Tar- taro, Ramona N Plant, Michael Egholm, Jonathan M Rothberg, and John H Leamon. Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. Bmc Genomics, 7(1):216, 2006. [87] Aaron R Quinlan and Ira M Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, 2010. [88] Harlan S Robins, Paulo V Campregher, Santosh K Srivastava, Abigail Wacher, Cameron J Turtle, Orsalem Kahsai, Stanley R Riddell, Edus H Warren, and Christopher S Carlson. Comprehensive assessment of T-cell receptor -chain diversity in T cells. Blood, 114(19):4099–4107, 2009. [89] RA Sack and AF Donovan. An algorithm for Gaussian quadrature given modified moments. Numerische Mathematik, 18(5):465–478, 1971. [90] Lalitha Sanathanan. Estimating the size of a truncated sample. Journal of the American Statistical Association, 72(359):669–672, 1977. [91] Yohei Sasagawa, Itoshi Nikaido, Tetsutaro Hayashi, Hiroki Danno, Kenichiro D Uno, Takeshi Imai, and Hiroki R Ueda. Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene- expression heterogeneity. Genome Biol, 14(4):R31, 2013. [92] Ehud Shapiro, Tamir Biezuner, and Sten Linnarsson. Single-cell sequencing- based technologies will revolutionize whole-organism science. Nature Reviews Genetics, 2013. [93] Barry Simon. The classical moment problem as a self-adjoint finite difference operator. Advances in Mathematics, 137(1):82–203, 1998. [94] David Sims, Ian Sudbery, Nicholas E Ilott, Andreas Heger, and Chris P Ponting. Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2):121–132, 2014. [95] Andrew D Smith. Personal correspondence. May 3, 2014. 75 [96] Andrew D Smith, Wen-Yu Chung, Emily Hodges, Jude Kendall, Greg Hannon, James Hicks, Zhenyu Xuan, and Michael Q Zhang. Updates to the RMAP short- read mapping software. Bioinformatics, 25(21):2841–2842, 2009. [97] William Squire and George Trapp. Using complex variables to estimate deriva- tives of real functions. Siam Review, 40(1):110–112, 1998. [98] Gabor Szeg¨ o. Orthogonal polynomials, volume 23. Amer Mathematical Society, 1975. [99] Sonia Tarazona, Fernando Garc´ ıa-Alcalde, Joaqu´ ın Dopazo, Alberto Ferrer, and Ana Conesa. Differential expression in RNA-seq: a matter of depth. Genome research, 21(12):2213–2223, 2011. [100] Chiharu Tayama, Valeria Romanelli, Alex Martin-Trujillo, Isabel Iglesias-Platas, Kohji Okamura, Naoko Sugahara, Carlos Sim´ on, Harry Moore, Julie V Harness, Hans Keirstead, et al. Genome-wide parent-of-origin dna methylation analysis reveals the intricacies of human imprinting and suggests a germline methylation- independent mechanism of establishment. Genome research, 24(4):554–569, 2014. [101] Pafnuty Tch´ ebychev. Sur l’interpolation par la m´ ethode des moindres carr´ es. Eggers et Comp., 1859. [102] Cole Trapnell, Lior Pachter, and Steven L Salzberg. TopHat: discovering splice junctions with RNA-seq. Bioinformatics, 25(9):1105–1111, 2009. [103] Evgenij E Tyrtyshnikov. How bad are Hankel matrices? Numerische Mathematik, 67(2):261–269, 1994. [104] Aaron B Wagner, Pramod Viswanath, and Sanjeev R Kulkarni. Probability esti- mation in the rare-events regime. Information Theory, IEEE Transactions on, 57(6):3207–3229, 2011. [105] Chuan Wang, Johanna K Sandling, Niklas Hagberg, Olof Berggren, Snaevar Sigurdsson, Olof Karlberg, Lars R¨ onnblom, Maija-Leena Eloranta, and Ann- Christine Syv¨ anen. Genome-wide profiling of target genes for the systemic lupus erythematosus-associated transcription factors IRF5 and STAT4. Annals of the rheumatic diseases, 72(1):96–103, 2012. [106] Ji-Ping Wang. Species: an r package for species richness estimation. Journal of Statistical Software, 40(9):1–15, 2011. [107] Ji-Ping Z Wang and Bruce G Lindsay. A penalized nonparametric maximum likelihood approach to species richness estimation. Journal of the American Sta- tistical Association, 100(471):942–959, 2005. 76 [108] Jianbin Wang, H Christina Fan, Barry Behr, and Stephen R Quake. Genome- wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell, 150(2):402–412, 2012. [109] David S Watkins. Understanding the QR algorithm. SIAM review, 24(4):427– 440, 1982. [110] John C Wheeler. Modified moments and Gaussian quadrature. Rocky Mountain J. Math, 4(2):287–296, 1974. [111] A Willis and John Bunge. A ratio-based method for estimating an unknown number of classes. arXiv preprint arXiv:1408.3333, 2014. [112] Xun Xu, Yong Hou, Xuyang Yin, Li Bao, Aifa Tang, Luting Song, Fuqiang Li, Shirley Tsang, Kui Wu, Hanjie Wu, et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell, 148(5):886– 895, 2012. [113] Chang Xuan Mao, Robert K Colwell, and Jing Chang. Estimating the species accumulation curve using mixtures. Biometrics, 61(2):433–441, 2005. [114] Zhiyuan Zhai, Gesine Reinert, Kai Song, Michael S Waterman, Yihui Luan, and Fengzhu Sun. Normal and compound Poisson approximations for pattern occur- rences in NGS reads. Journal of Computational Biology, 19(6):839–854, 2012. [115] Michael J Ziller, Hongcang Gu, Fabian M¨ uller, Julie Donaghey, Linus T-Y Tsai, Oliver Kohlbacher, Philip L De Jager, Evan D Rosen, David A Bennett, Bradley E Bernstein, et al. Charting a dynamic DNA methylation landscape of the human genome. Nature, 500(7463):477–481, 2013. [116] Chenghang Zong, Sijia Lu, Alec R Chapman, and X Sunney Xie. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338(6114):1622–1626, 2012. 77 AppendixA Appendix A.1 Allbinningstrategiesarebiased We define a binning based estimator as an estimator of the genome coverage that treats all positions in a non-overlapping bin as equal. If we say the bin is covered by a read then all positions in the bin are considered to be covered by the read. We consider a binning based estimator to be unbiased for a single read if the probability that a random read covers the bin is equal to the sum of the coverage probabilities of all positions in the bin. Our intuition is that any binning based estimator can not be unbiased for any number of reads simultaneously. The underlying reason for this is Jensen’s inequality. When moving from a single read to multiple reads, the appropriate transformation of the expected sequencing probability is a convex function. This will imply that any binning based estimator cannot be unbiased for both simultaneously. We’ll prove this for all non-trivial binning based estimators in a simplified model of sequencing. Suppose that every position is independently covered with probability p i ; i = 1;:::;G. Instead of considering N reads of length L, we will consider NL = M reads of length 1. LetB denote the binning based estimator that partitions the genome intoB <G non-overlapping bins 1 ;:::; B .B will be unbiased for a single read if for everyj2f1;:::;Bg, the probability a random read covers bin j , ~ p j , is equal to the sum of probabilities that the random read covers the positions in the bin, P i2 j p i . The 78 expected number of positions not covered by sequencing a single read is therefore equal to B X j=1 j j j 1 X i2 j p i = B X j=1 G X i=1 1(i2 j )(1 ~ p j =j j j): Effectively the binning strategy averages the sequencing probabilities of all positions in the bin. Proposition 1. Under a multinomial model of sequencing single base pair reads, any binning based estimator that is unbiased for a single read is unbiased for multiple reads if and only if for every bin, all positions have equal sequencing probability. Proof. Suppose that the binning based estimatorB is unbiased for a single read. Con- sider M > 1 sequenced bases. The probability bin j is not covered is equal to 1 P i2 j p i M . The expected number of positions contained in uncovered bins is equal to B X j=1 j j j 1 X i2 j p i M = B X j=1 G X i=1 1(i2 j ) 1 ~ p j =j j j M : Note that (1x) M is a convex function forx > 0. We apply Jensen’s inequality for each bin to infer that for everyj2f1;:::;Bg, G X i=1 1(i2 j ) 1 ~ p j =j j j M G X i=1 1(i2 j ) 1p i M : The inequality will be strict unless the coverage probability of every position in binj is equal. We sum up the inequalities over all the bins to infer that B X j=1 G X i=1 1(i2 j ) 1 ~ p j =j j j M G X i=1 1p i M : The inequality will be strict unless the coverage probabilities in every bin are equal. 79 Trivially if all sequencing probabilities are equal then any binning based estima- tor will have equal sequencing probabilities in every bin. In the case of heterogenous sequencing probabilities, a binning based estimator will have equal sequencing prob- abilities only if the sequencing probabilities can be known beforehand. This would mean that bias introduced in the sequencing process would be known. The properties of sequencing bias is still an area of intense study which can benefit analysis significantly but is still not completely characterized [94]. 80 A.2 A class of orthogonal polynomials that extend gen- eralizedLaguerrepolynomials The standard monic generalized Laguerre polynomials [98] indexed by parameterr > 1 are defined by orthogonality to measuree x x r dx defined on (0;1). They are given by L (r) n (x) = n X l=0 n +r nl n! l! (1) n+l x l = (1) n e x x r d dx n e x x n+r ; (A.1) with n+r nl = (n+r+1) (nl+1)(l+r+1) . Additionally they are the only polynomial collations to the differential equation xy 00 + (r + 1x)y 0 +ny = 0; y =L (r) n (x): (A.2) We take particular interest in the three-term recurrence relation satisfied by the orthog- onal polynomials, L (r) n+1 (x) = (x n )L (r) n (x) n L (r) n1 (x); n = 2n +r + 1; n = (n +r)n: (A.3) Here we discuss a system of monic orthogonal polynomials that extend the gener- alized Laguerre polynomials which we shall call extended Laguerre polynomials and denoted byfL (r;) n (x)g n0 . For a parameterr and, the monic extended Laguerre poly- nomials are defined by orthogonality to the measure x r e x dx. We shall derive the corresponding expansion, Rodrigues’ formula, and differential equation that they solve. 81 A.2.1 Rodrigues’formula The central part of this paper is an expansion ofL (r;) n (x) analogously to equation (A.1) given by Rodrigues’ formula L (r;) n (x) = (1) n n x r e x d dx n e x x n+r = n X l=0 n +r nl n! l! ln (1) n+l x l : (A.4) We shall prove the following lemma to establish the orthogonality of the polynomials given above. Lemma 2. Let L (r;) n (x) be given by equation (A.4) for n = 0; 1;:::: Then fL (r;) n (x)g n0 is the unique system of monic orthogonal polynomials over d(x) = x r e x forr>1 and> 0. Proof. We first show that for fixedn,L (r;) n (x) is orthogonal to any lower order polyno- mial. Letq m (x) be a polynomial of orderm<n and letC denote an arbitrary constant that may change from use to use. Integration by parts yields the following, Z 1 0 L (r;) n (x)q m (x)d(x) =C Z 1 0 x r e x d dx n e x x n+r q m (x)x r e x dx =C d dx n1 e x x n+r q m (x) 1 0 C Z 1 0 d dx n1 e x x n+r d dx q m (x)dx: Sinceq m is a polynomial of orderm, we can do thism times to yield the following Z 1 0 L (r;) n (x)q m (x)d(x) = m X j=1 C(1) j d dx nj e x x n+r d dx j1 q m (x) 1 0 : 82 The integral will be zero if d dx j e x x n+r 1 0 = 0 for everyj < n. By the general Leibniz rule, d dx j e x x n+r 1 0 = j X l=0 j l d dx jl e x d dx l x n+r 1 0 = j X l=0 j l (1) jl jl e x l1 Y i=0 (n +ri) ! x n+rl 1 0 : Every term in the sum equals zero as long asr>1 andj <n, implying that the entire sum equals zero, proving the orthogonality ofL (r;) n (x) underd(x). We can define an inner product associated with d(x) over the polynomials by (p(x);q(x)) = R 1 0 p(x)q(x)d(x). d(x) is a positive measure over the positive reals, implying it is positive definite. We can generate orthogonal polynomialsf n (x)g n0 to d(x) by Gram-Schmidt orthogonalization, with n (x) = x n P n1 l=0 l (x) (x r ; l (x)) ( l (x); l (x)) : The positive definiteness ofd(x) implies each numerator above is positive and there- fore n (x) is unique for eachn and equal toL (r;) n (x). A.2.2 Three-termrecurrence Lemma 3. Let n = 2n+1+r forn 0 and n = (n+r)n 2 forn 1 and 0 = 1. Then L (r;) n (x) satisfies L (r;) n+1 (x) = (x n )L (r;) n (x) n L (r;) n1 (x) with initial conditions L (r;) 1 (x) = 0 andL (r;) 0 (x) = 1. 83 Proof. We shall prove this by direct calculation using the expansion on the far right hand side of equation (A.4). x 2n + 1 +r L (r;) n (x) (n +r)n 2 L (r;) n1 (x) = x 2n + 1 +r n X l=0 n +r nl n! l! ln (1) n+l x l (n +r)n 2 n1 X l=0 n 1 +r n 1l (n 1)! l! ln+1 (1) n1+l x l Group the sum by the degree ofx. The highest order term isx n+1 with coefficient 1.x n has coefficient n +r 1 n! (n 1)! 1 (2n + 1 +r) 1 n +r 0 n! n! 0 (1) 2n = (n +r)n + 2n + 1 +r) 1 = (n + 1 +r)(n + 1) 1 = n + 1 +r 1 (n + 1)! n! 1 : The constant term is equal to (2n + 1 +r) n +r n n! n1 (1) n+1 + (n +r)n n 1 +r n 1 (n 1)! n1 (1) n = (n +r + 1) (r + 1) n1 (1) n+1 2n + 1 +rn = n + 1 +r n + 1 (n + 1)! n1 (1) n+1 (A.5) 84 Finally we need to compute the coefficients forx l forl = 1;:::n 1. n +r n (l 1) n! (l 1)! l1n (1) n+l1 + (2n + 1 +r) n +r nl n! l! ln1 (1) n+l1 + (n +r)n n 1 +r n 1l (n 1)! l! ln1 (1) n+l (A.6) Factor out the common terms, (n+r+1)n! (n1l)!(l+r)(l1)! l(n+1) (1) n+1+l , and the remainder is 1 (n + 1l)(nl) + 2n + 1 +r (nl)(l +r)l 1 l(l +r) = 1 (n + 1l)(nl)(l +r)l (l +r)l + (2n +l +r)(n + 1l) (n + 1l)(nl) = (n + 1 +r)(n + 1) (n + 1l)(nl)(l +r)l : We can see that combining the reduced remainder with the common terms, we arrive at the correct coefficient for thelth term, n+1+r n+1l (n+1)! l! l(n+1) (1) n+1+l , which com- pletes the proof. A.2.3 2ndorderdifferentialequation Here we show that the systemfL (r;) n g n0 solve a linear homogeneous differential equa- tion analogous to equation (A.2), as summarized in the following lemma. Lemma4. y =L (r;) n (x) solves the second order differential equation xy 00 + (r + 1x)y 0 +ny = 0: (A.7) 85 Proof. d dx L (r;) n (x) = n1 X l=0 n +r n (l + 1) n! l! (1) n+l+1 l+1n x l d 2 dx 2 L (r;) n (x) = n2 X l=0 n +r n (l + 2) n! l! (1) n+l l+2n x l : Plugging into equation (A.7) and grouping by the power ofx yield 3 cases for the coef- ficient: the highest order terms ( l = n), the constant term (l = 0), and in between, (1l<n). The collected coefficient of the highest order is calculated as n +r 0 n! (n 1)! (1) n+l ln +n n +r 0 n! n! (1) n+l ln = 0: The collected constant term can is calculated as (r + 1) n +r n 1 n! 0! (1) n+1 1n +n n +r n n! 0! (1) n n = 0: For fixedl between 1 andn 1, the collected coefficient is n +r n (l + 1) n! (l 1)! (1) n+l+1 l+1n + (r + 1) n +r nl n! l! (1) n+l+1 l+1n n +r nl n! (l 1)! (1) n+l ln +n n +r nl n! l! (1) n+l ln = (n +r + 1)n!(1) n+l ln (nl 1)!(l +r + 1)(l 1)! l + 1 +r (r + 1) (l + 1 +r)l nl + n (nl)l ! = 0 86 since the first three terms in the parentheses sum to n (nl)l + (r + 1)(nl) + (l + 1 +r)l (l + 1 +r)l(nl) = n l(nl) : Therefore the term in the parentheses is zero and the lemma is proven. 87
Abstract (if available)
Abstract
Consider an experiment where observations are sampled or arrive from an unknown population made up of a finite but unknown number of classes. The class membership of an observation can be determined upon sampling but the unobserved classes are completely unknown. For example when capturing butterflies to assess the species diversity, it is not difficult to distinguish a new species of butterfly but hard to know beforehand what an unseen species of butterfly will look like. This is commonly known as a capture-recapture experiment. The classical application is to estimate the diversity of species, i.e. the total number of distinct species, without exhaustive sampling. Additional questions that can be posed are evaluation of the cost versus benefit of additional sampling or continuing the experiment, where benefit is measured in number of distinct classes observed, or to estimate the relative abundance of classes, including the unobserved classes. ❧ The generality of the capture-recapture model results in a wide range of applications, a large number of which can be found in Bunge & Fitzpatrick (1993). A few more recent examples of applications include estimating immunological repertoire, estimating genomic diversity, natural language modeling, and estimating bacterial diversity. ❧ Our interest in capture-recapture arises in the context of high-throughput next-generation sequencing experiments. In these experiments a sequencing library is generated by first obtaining genomic material, then possibly subjecting it to processing steps such as poly(A) selection for mRNA-seq or chromatin immunoprecipitation for ChIP-seq. The genomic material is amplified so that thousands or millions of copies of each original genomic molecule is contained in the library. Finally the library is sequenced by capturing or sampling molecules and identifying portions of their genomic sequence. The obtained sequences have errors arising from miscalled bases or misincorporated nucleotides or mutations in the amplification process. This means that mapping to a reference genome or sequence correction is necessary to identify duplicate reads or molecules. ❧ The model we assume throughout is that the number of sequenced reads mapping to each position in the genome follows a Poisson process, with the possibility that some positions are not present in the library. If all positions are uniformly represented in the library then the number of sequenced reads from each position follows a Poisson distribution. This is commonly referred to as the Lander-Waterman model of sequencing, originally derived to enable planning of fingerprinting schemes for genome assembly. In such cases the genome was built by overlapping long genomic segments sequenced by Sanger technology. With a reference genome the need for overlap analysis is gone so that the distribution of the number of reads covering each position is a Poisson. In the context of high-throughput sequencing, the uniformity assumption is lost due to a myriad of biases, requiring more complicated models that can accommodate the biases. Additionally, there may be portions of genome not represented in the library. One example is in mRNA-seq, where regions not transcribed will be absent. ❧ In order to accommodate arbitrary biases and possible absence of molecules or loci in a sequencing library we shall consider a non-parametric Poisson model for sequencing. We assume that the number of reads from each position follows a compound Poisson process with rates independently and identically distributed according to some unknown compounding distribution. Furthermore we assume the additional possibility that some portions of the genome are missing from the library and can never be sequenced. It is our goal to consider the compounding distribution to be as general as possible. Under this model we shall consider the following questions: What is the gain in new reads or new loci with deeper sequencing? What is the gain in the proportion of the genome covered with deeper sequencing? How many more reads or loci will be observed at least k times for some k > 1 with deeper sequencing? How does the total relative abundance of the unobserved reads or loci change with deeper sequencing? How many total distinct reads or loci are contained or represented in the library? ❧ Here non-parametric empirical Bayes (NPEB) estimators for the first four questions are presented. These estimators are unbiased or unbiased for the identifiable portions of the estimated quantities. These estimators will take the form of power series in the variable t, that measures how far out the prediction is relative to the observed experiment and will be called the fold extrapolation. These estimators suffer from extreme variance and instability for large fold extrapolations due to being given in power series, typically alternating. We introduce rational function approximations to obtain globally stable estimates that are extremely accurate in practice. This allows us to obtain estimates far more accurate for much further extrapolations than previously considered. ❧ The last question is commonly known as the species diversity problem in the capture-recapture literature. This is a notoriously difficult problem without severe restrictions on the form of the compounding distribution. The number of unobserved reads is non-identifiable and therefore no unbiased estimator can exist in the non-parametric setting. It is non-identifiable even when considering the compounding distribution belonging to one of two families such as the gamma distribution and finite mixtures. By reformulating the problem as bounding an integral over an unknown distribution given estimated moments of the unknown distribution, we can use Gaussian quadrature to obtain lower bounds that use significantly more information than previous methods. ❧ Note that most of the above questions can be asked in the context of general capture-recapture experiments, if one replaces reads or loci by species and sequencing by sampling. The difference is that sequencing experiments produce data orders of magnitude larger than in traditional capture-recapture experiments. Our algorithms will utilize the scale of high-throughput sequencing and so will be suitable for large capture-recapture experiments. ❧ We shall discuss the first question in chapter 2, that corresponds to the paper of Daley & Smith (2013). Chapter 3 discusses the second question and can also be found in the paper Daley & Smith (2014). The rest of the material is new.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Large scale inference with structural information
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
A nonlinear pharmacokinetic model used in calibrating a transdermal alcohol transport concentration biosensor data analysis software
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Parametric and non‐parametric modeling of autonomous physiologic systems: applications and multi‐scale modeling of sepsis
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Profiling transcription factor-DNA binding specificity
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Multi-population optimal change-point detection
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Bayesian hierarchical models in genetic association studies
Asset Metadata
Creator
Daley, Timothy Patrick
(author)
Core Title
Non-parametric models for large capture-recapture experiments with applications to DNA sequencing
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
10/29/2014
Defense Date
10/17/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
capture-recapture,DNA sequencing,empirical Bayes,non-parametric,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Waterman, Michael S. (
committee chair
), Lototsky, Sergey V. (
committee member
)
Creator Email
tdaley@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-509434
Unique identifier
UC11288070
Identifier
etd-DaleyTimot-3040.pdf (filename),usctheses-c3-509434 (legacy record id)
Legacy Identifier
etd-DaleyTimot-3040.pdf
Dmrecord
509434
Document Type
Dissertation
Rights
Daley, Timothy Patrick
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
capture-recapture
DNA sequencing
empirical Bayes
non-parametric