Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exchangeable pairs in Stein's method of distributional approximation
(USC Thesis Other)
Exchangeable pairs in Stein's method of distributional approximation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXCHANGEABLE PAIRS IN STEIN'S METHOD OF DISTRIBUTIONAL APPROXIMATION by Nathan Forrest Ross A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MATHEMATICS) August 2009 Copyright 2009 Nathan Forrest Ross Acknowledgments First and foremost I thank my advisor, Jason Fulman, who has given me nearly all that I have mathematically. I consistently discover his in uence on the types of problems I enjoy, the refer- ences I nd useful, and the professional choices I make. I also thank Andrew Barbour, Louis Chen, and Larry Goldstein, all of whom played a part in allowing me to participate in the conference and workshop \Progress in Stein's Method" held at the National University of Singapore, an event that had great impact on me. A nal mathematical acknowledgment goes to Adrian R ollin for giving me his take on Stein's method. On a more personal note, I thank Lerna Pehlivan and Derek Butcher for being available to commiserate over the tribulations of graduate school. I should probably also mention, on the reader's behalf, my brother Nick and his friend Dick, who were a constant source of friendly distraction. Without them, this document would certainly be longer and more technical. Finally, I thank my family. To list the ways in which they have in uenced me would take up an inappropriate amount of space. However, I am especially indebted to the following: my father, whose in uence on my personality surprises me to this day; my mother, who showed me the importance of choosing a purpose in life that is worthwhile; my late paternal grandfather, whose memory becomes more pleasurable as it ages; my maternal grandfather, whose support has allowed me to do what I enjoy; and my sisters, whose vigilance in withstanding a Vermont upbringing inspires me daily. ii Table of Contents Acknowledgments ii List Of Tables v Abstract vi Chapter 1: Introduction 1 1.1 Exchangeable Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Background 7 2.1 Notation and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 The Fundamental Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Normal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Poisson Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.1 Size Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.2 Zero Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.3 Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.4 Ad Hoc Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3: The Eect of the Exchangeable Pair 39 3.1 Normal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Plancherel Distribution of the Hamming Scheme . . . . . . . . . . . . . . . . . . . 46 3.4 Plancherel Distribution on a Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Eigenvalue Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 Poisson Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 4: Theorems and Applications 93 4.1 Poisson Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.1 i-cycle Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.1.2 Poisson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2 Translated Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 Smoothing Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.1 The Sum of Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.2 Coupon Collector's Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 iii 4.3.3 Two Runs in a Bernoulli Sequence . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.4 Isolated Vertices of a Random Graph . . . . . . . . . . . . . . . . . . . . . . 119 Bibliography 124 iv List Of Tables 2.1 Representation of Common Probability Metrics . . . . . . . . . . . . . . . . . . . . 10 3.1 Labeled tableau for = (5; 3; 3; 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 = (5; 3; 3; 1) with box i labeled c (i) . . . . . . . . . . . . . . . . . . . . . . . . . 59 v Abstract Stein's method is a powerful tool used to obtain error terms in distributional approximation and limit theorems. In one formulation, an auxiliary stochastic object termed an exchangeable pair must be created in order to apply the method. This dissertation has two main purposes, the rst is to examine the eect that the choice of exchangeable pair has in the computation and quality of the error, and the second is to derive new tools using exchangeable pairs and the main principles of Stein's method. In examining the method, we will focus on examples of a combinatorial and algebraic nature in theorems pertaining to normal and Poisson approximation. For normal approximation, we analyze the role of the exchangeable pair for the binomial distribution and the Plancherel measure of a random walk on the symmetric group. Poisson approximation for exchangeable pairs is less developed than normal approximation, so that the examples presented are the binomial and negative binomial distributions, both of which have representations as sums of independent random variables. The new tools developed here include exchangeable pair formulations for Poisson and trans- lated Poisson approximation which are more in the spirit of existing normal approximation the- orems than those available. Also, there have been many recent developments adapting Stein's method to approximation by discrete analogs of the normal distribution which contain a smooth- ing term that is the main technical diculty in applications. Exchangeable pairs can be used to derive a simple bound on this term that easily reproduces many results found in the literature. vi Chapter 1 Introduction One of the main purposes of probability theory is to quantitatively determine the distribution of a random object described qualitatively. However, it may be the case that exactly determining the distribution in the situation under study is computationally or theoretically impossible. In this case, it can be appropriate to substitute a more easily handled distribution for that which is desired. As an example, consider two strands of DNA that are known to be evolutionarily related. What is the distribution of the number of evolutionary changes needed to produce one strand from the other? By choosing a model that re ects all of the underlying genetic processes, it is unlikely that this distribution could be computed precisely, but it may be possible to approximate it by some other simpler distribution. Of course one issue with such a program is the question of what information is lost by using an approximation; that is, how well do the numbers computed in the approximation match the true values sought? Stein's method, initially formulated by Charles Stein [77] and rened in later work [78], can answer this question in a variety of situations, most interestingly where dependence plays a role so that classical Fourier techniques are ill-suited. The main germ behind Stein's method is to replace the characteristic function of a distribution, which has traditionally been used to show distributional equivalence, with a functional operator that characterizes the distribution. This idea has been eusively exploited to yield an assortment of specic methods all of which apply under 1 some form of dependence; for example, size biasing [10, 52], zero biasing [50], and dependency structures [2, 3, 10]. We will discuss some of these techniques and their applications in Section 2.6, but this dissertation will focus on an early variant of the method, that of exchangeable pairs. Initially, the method of exchangeable pairs was used for normal approximation [68, 78], but has expanded to Poisson approximation [22] and translated Poisson approximation [70] (a discrete analog of normal approximation), and is currently being developed for exponential approximation [23], and multivariate normal approximation [24, 61, 67]. Additionally, exchangeable pairs have recently been used to obtain novel concentration inequalities for random variables outside of the typical examples of large deviation theory [18, 19]. In this document, we will discuss exchangeable pairs in normal approximation, Poisson and translated Poisson approximation, and develop some new tools that quantify the \smoothness" of an integer supported distribution. The next section collects some facts and properties of exchangeable pairs. 1.1 Exchangeable Pairs We say (W;W 0 ) is an exchangeable pair of random variables if the the distribution of (W;W 0 ) is equal to the distribution of (W 0 ;W ). In particular, this implies that W and W 0 are identically distributed random variables. The following proposition (which will be used frequently and im- plicitly in the sequel) states a useful property of exchangeable pairs that is at the center of their utility within Stein's method. Proposition 1.1. If the function F :R 2 !R is anti-symmetric (that is, F (x;y) =F (y;x)), then EF (W;W 0 ) = 0. Proof. Exchangeability implies EF (W;W 0 ) = EF (W 0 ;W ), and the anti-symmetry of F implies EF (W;W 0 ) =EF (W 0 ;W ). 2 To apply Stein's method of exchangeable pairs to approximate some random variable W by a chosen target distribution, an exchangeable pair (W;W 0 ) must be constructed that has marginal distributions equal to that of W . From this point, the exchangeable pair and an anti-symmetric function can be used to dene a pseudo characterizing operator for W that is then compared to the characterizing operator of the target distribution. Heuristically, if these two operators are close, then W will be approximately equal to the target distribution. We will see later that this heuristic can be made rigorous. As a simple example of what can be achieved using exchangeable pairs, we present here an easy theorem that is the main result of Section 4.3; a discussion of its utility can be found there. For the following, let I fBg be the indicator that the event B occurs, E W denote the conditional expectation with respect to W , andP W (B) =E W I fBg . Theorem 4.8 LetW a random variable with integer support and (W;W 0 ) an exchangeable pair. Then for any AZ, we have jP(W2A)P(W + 12A)j p Var[P W (W 0 =W + 1)] + p Var[P W (W 0 =W 1)] P(W 0 =W + 1) : Proof. By exchangeability and Proposition 1.1, we have for all bounded functions g and any constant c, 0 =E[cI fW 0 =W+1g g(W 0 )cI fW 0 =W1g g(W )] =E[cP W (W 0 =W + 1)g(W + 1)cP W (W 0 =W 1)g(W )]: (1.1) 3 This calculation implies the following. jE[g(W + 1)g(W )j =jE[(1cP W (W 0 =W + 1))g(W + 1) (1cP W (W 0 =W 1))g(W )]j (1.2) jjgjj 1 Ej1cP W (W 0 =W + 1)j +Ej1cP W (W 0 =W 1)j : Choosing c = [P(W 0 = W + 1)] 1 , g(j) = I fj2Ag , and applying the Cauchy-Schwarz inequality proves the lemma. We note here the relation of the proof to the discussion preceding the theorem, as it is a representative outline of what is to come. First observe that the term to be estimated is given by EAg(W ), where the functional operatorA is dened byAg(j) = g(j + 1)g(j) and g is as in the proof. Using the exchangeable pair (W 0 ;W ) and the anti-symmetric function F (W;W 0 ) =cI fW 0 =W+1g g(W 0 )cI fW 0 =W1g g(W ); we dene the \pseudo characterizing operator"A 0 applied to bounded functions g byA 0 g(W ) = E W F (W;W 0 ). 1 Finally, comparing the two operators as in equation (1.2) yields the theorem. With an additional main step, this is essentially the framework of Stein's method for distributional approximation. A nal unaddressed issue with this setup is the question of how to create exchangeable pairs with amenable properties. The typical method of constructing exchangeable pairs of random variables on a denumerable space is through reversible Markov chains. We say the Markov chainfX 0 ;X 1 ;:::g on with transition matrix P is reversible with respect to a distribution 1 The operatorA 0 is termed as such because equation (1.1) implies EA 0 g(W ) = 0. If, in addition, for any random variable X, EA 0 g(X) = 0 implied X d = W , thenA 0 would be called a characterizing operator. This will be discussed in more detail in Chapter 2. 4 on if (i)P ij =(j)P ji for all i;j2 . From this point we have the following proposition that can be used to derive exchangeable pairs in non-trivial settings. Proposition 1.2. Let all notation as in the previous paragraph and W be a random variable on (that is, a function from into R). If X 0 is distributed as , then (W (X 0 );W (X 1 )) is an exchangeable pair. Proof. We will show P(X 0 = i;X 1 = j) = P(X 1 = i;X 0 = j) for all i;j2 , an equation that implies the proposition. To this end, we have P(X 0 =i;X 1 =j) =P(X 0 =i)P ij (1.3) =(i)P ij =(j)P ji (1.4) =P(X 0 =j)P ji =P(X 1 =i;X 0 =j): (1.5) Here (1.3) and (1.5) follow from the fact thatX 0 is distributed as and (1.4) by reversibility. Most exchangeable pairs found in the Stein's method literature are constructed (sometimes implicitly) through Proposition 1.2. A notable example where this is not the case can be found in [38]. In [68], it is observed that ifW is distributed asW 0 andWW 0 2f1; 0; 1g, then (W;W 0 ) is exchangeable (this is equivalent to the fact that any discrete time birth death Markov chain is reversible with respect to its stationary distribution, subject to existence). 1.2 Summary In Chapter 2 we give necessary background and notation including the fundamental idea be- hind Stein's method of exchangeable pairs, neat derivations of some known theorems, a new exchangeable pairs Poisson approximation theorem (Theorem 2.15), and an overview of other (non exchangeable pair) methods. 5 In Chapter 3 we examine the role of the exchangeable pair in the main normal and Poisson approximation theorems of Stein's method (respectively, Theorems 2.5 and 2.13 below). More specically, we examine how modifying the step size of the underlying Markov chain (which induces the exchangeable pair) aects the error term in the approximation acquired through Stein's method. The main result is that Markov chains with smaller steps size (that is, more local chains) produce better error terms. A minor variation of this chapter has been written into a self contained paper [73]. In Chapter 4 we develop a new translated Poisson approximation theorem (Theorem 4.7) and discuss applications of Theorem 4.8 proved above. The translated Poisson distribution is a candidate for a discrete analog of the normal distribution, so that it can be used as the target distribution when approximating an integer valued random variable. This strategy is employed in order to obtain approximation theorems in metrics stronger than those typically used in normal approximation. The main translated Poisson approximation theorem of Chapter 4 has the poten- tial to obtain nearly out the door error terms in applications where Stein's method of exchangeable pairs for normal approximation has previously been developed. The applications of Theorem 4.8 predominantly stem from results existing in the literature. These results were previously obtained through much technical and conceptual work, and we easily reproduce them here. Because of this ease of use, we believe Theorem 4.8 has a much wider scope than that illustrated in these examples; we hope to address this potential in future research. 6 Chapter 2 Background In this chapter, we will rst clarify necessary notation and denitions in Section 2.1. Afterwards, we will abstractly review the basic notions behind Stein's method in Section 2.2, and then, us- ing exchangeable pairs, concretely apply the ideas developed there to normal approximation in Section 2.3, and Poisson approximation in Section 2.4. In Section 2.5, we will discuss the funda- mental dierence between the main normal approximation theorem of Section 2.3 and the Poisson approximation theorem of Section 2.4, culminating in Theorem 2.15, a novel exchangeable pairs Poisson approximation theorem. Finally, in Section 2.6, we will discuss other approaches to the fundamental idea of Section 2.2. With the exception of Sections 2.1 and 2.5, the ideas expressed in this chapter can be found in [10, 22, 28, 78], the predominant introductory works for Stein's method. 2.1 Notation and Denitions Stein's method is used to obtain an error in the approximation of a random variable W by a target random variable Z. Implicit in this statement is a metric in which the approximation is taking place, useful properties of the random variable Z, and the ability to rigorously dene the random variable W . In this section, we will dene much of the terminology and notation used in the following which is pertinent to these three items. 7 There are many metrics commonly used to quantify the dierence between two random vari- ables [44], but here we only focus on three: Kolmogorov, Wasserstein, and total variation. The Kolmogorov distance is the metric used in the classical central limit theorem; that is, it quanties weak convergence. If the real random variables X and Y have distribution func- tions F and G respectively, then dene their Kolmogorov distance, denoted d K (X;Y ), to be sup x2R jF (x)G(x)j. The Wasserstein distance is stronger than Kolmogorov distance in the sense that if a sequence of random variables converges to some random variable in Wasserstein distance, then the convergence also occurs in Kolmogorov distance. For real random variables X and Y , dene their Wasserstein distance, denoted d W (X;Y ), to be sup h2H jEh(X)Eh(Y )j, whereH =fh : R! R :jh(x)h(y)jjxyjg (the set of Lipschitz functions with Lipschitz constant equal to one). The nal metric we will dene is total variation distance, a relatively strong metric primarily used for discrete random variables. LetB(R) denote the Borel sets of R. Then for real random variablesX andY , dene their total variation distance, denotedd TV (X;Y ), to be sup AB(R) jP(X2A)P(Y 2A)j. The target random variables we will be discussing here have the normal, Poisson, and trans- lated Poisson distributions. The normal and Poisson distribution are probably the most well known distributions in probability, the former having two parameters, its mean and variance 2 which dene the density f(x) = 1 p 2 exp (x) 2 2 2 ; x2R; and the latter having one parameter, its mean which dene the point mass function p(k) =e k k! ; k2f0; 1;:::g: 8 Properties of these two distributions can be found in any good introductory text on probability. The translated Poisson distribution is not as well known, but as the name indicates, it is a Poisson distribution translated (by an integer). We will dene and discuss the translated Poisson distribution in more detail in Chapter 4. In general, we will discuss the random variables to be approximated when they arise. However, we will frequently use the sum of independent Bernoulli random variables as a toy example to illustrate the theory. A Bernoulli random variable with parameter p is equal to one with probability p and zero with probability 1p (thus 0 p 1). If X is a sum of n independent Bernoulli random variables, all having parameter p, then we say X has the binomial distribution with parameters n and p. Finally, we state some notation that frequently appears in the following. For an event B, dene I fBg to be one if B occurs, and zero otherwise. Denote the conditional expectation of Y with respect to a random variable W by either E W Y or E[YjW ] and for an event B, dene P(BjW ) =P W (B) =E W I fBg . 2.2 The Fundamental Idea As discussed in Chapter 1, the main purpose of Stein's method is to quantify the dierence between a random variable W and some target random variable Z about which much is known (such as a Poisson or normal random variable). In order to achieve this end, we must rst dene a measure of distance between two random variables; we will use metrics that can be expressed as d H (W;Z) = sup h2H jEh(W )Eh(Z)j: (2.1) Table 2.1 indicates that this formulation is not too restrictive. 9 Metric Indexing Set Kolmogorov H :=fh(x) =I fx<x0g :x 0 2Rg Wasserstein H :=fh :jh(x)h(y)jjxyjg Total Variation H :=fh(x) =I fx2Bg :B2B(R)g Table 2.1: Representation of Common Probability Metrics The following denition is at the kernel of Stein's method, and will be used to recast the metrics expressible as (2.1) in a more useful representation. Denition. The characterizing operator of a random variable Z is an operatorA such that, for a specied class of functionsG,EAf(Y ) = 0 for all f inG if and only if Y d =Z. Now, notice that ifA is a characterizing operator of Z, thenAf(x) =h(x)Eh(Z) for some functionh (becauseAf2f1g ? in anL 2 sense). This point of view suggests that if the functional equation Af h (x) =h(x)Eh(Z) (2.2) can be solved in f h for all h2H (the familyH depending on the metric being used), then d H (W;Z) = sup h2H jEAf h (W )j: (2.3) It is worth emphasizing here that although the main problem has been rephrased in a somewhat convoluted manner, this extra structure will yield results which do not appear possible to obtain directly from the starting point. To summarize, in order to apply the machinery developed here, we must 1. Dene the characterizing operatorA for the target random variable Z, 2. Solve the functional equation (2.2) and determine properties of the solution f h , 3. Estimate the right hand side of (2.3). 10 Items 1 and 2 must be done only once for any given distribution, where as Item 3 may be approached through dierent ad hoc and standard methods. In the following two sections, we will develop items 1 and 2 for the normal and Poisson distribution and then handle Item 3 using exchangeable pairs. In the nal section of this chapter we will discuss other approaches to treating Item 3. 2.3 Normal Approximation In order to implement the program described in Section 2.2, we must rst nd a characterizing operator for the normal distribution. Dene the operatorA by Af(x) = 2 f 0 (x) (x)f(x): (2.4) Proposition 2.1. Z is a normal random variable with mean and variance 2 if and only if EAf(Z) = 0, for all dierentiable f such that Ejf 0 (Z)j exists. Proof. We rst prove that for Z a normal random variable, we have EAf(Z) = 0. Assume for simplicity that EZ = 0 and EZ 2 = 1, so thatAf(Z) = f 0 (Z)Zf(Z). Since Z has density (t) = (2) 1=2 e t 2 =2 we have Ef 0 (Z) = Z 1 1 f 0 (t)(t)dt = Z 1 0 f 0 (t) Z 1 t z(z)dz dt Z 0 1 f 0 (t) Z t 1 z(z)dz dt = Z 1 0 z(z) Z z 0 f 0 (t)dt dz Z 0 1 z(z) Z 0 z f 0 (t)dt dz = Z 1 1 [f(z)f(0)]z(z)dz =EZf(Z): 11 To nish the proof, notice that for a random variable Y such that Ef 0 (Y ) =EYf(Y ) for all f such that Ejf 0 (Y )j is nite, we have that EY = 0 (choose f(y) = 1). Taking f(y) = y in the equation impliesEY 2 = 1, and continuing in this manner shows that all moments of Y are those of a standard normal random variable, which proves the proposition. Remark. In general, a good guide for determining a characterizing operatorA p for a continuous distribution X with dierentiable density function p(x) is A p f(x) =f 0 (x) + p 0 (x) p(x) f(x); as this will usually implyEA p f(X) = 0. Continuing to Item 2 in the program described in Section 2.2, we have the following. Proposition 2.2. The solution ofAf h (x) =h(x)Eh(Z) is given by f h (x) = 2 exp[(x) 2 =(2 2 )] Z x 1 [h(t)Eh(Z)] exp[(t) 2 =(2 2 )]dt: (2.5) Proof. With the formula for f h (x) in hand, the proposition is proved by simple substitution and verication. The next lemma collects some properties of f h (x) given by (2.5) in the case where = 0 and 2 = 1 (a case to which we will eventually restrict ourselves). The proof is a detailed technical analysis found, for example, in the appendix of [28]. Lemma 2.3. [28] 1. For any absolutely continuous function h :R!R, the solutionf h given by (2.5) in the case where = 0 and 2 = 1 satises jjf h jj min n p =2jjhEh(Z)jj; 2jjh 0 jj o ; 12 jjf 0 h jj minf2jjhEh(Z)jj; 4jjh 0 jjg; jjf 00 h jj 2jjh 0 jj: 2. For h(x) = I fx<zg , the solution f h given by (2.5) in the case where = 0 and 2 = 1 satises 0<f h p 2=4; jjf 0 h jj 1: Now that we have handled Items 1 and 2 described in Section 2.2, we can attend to Item 3: using exchangeable pairs to estimatejEAf h (W )j for some random variable W . The main purpose of exchangeable pairs (and some of the other approaches to handling Item 3; see Section 2.6) is to simplifyEAf h (W ) by subtracting a small (or zero) quantityEA 0 f(W ) for some operatorA 0 closely resemblingA. The operatorA 0 is the \pseudo characterizing operator" mentioned in Section 1.1; for the sake of brevity we will refer to it as a null operator of W . The next proposition builds on these ideas. Lemma 2.4. Let (W;W 0 ) an exchangeable pair such that EW = , and E W [W 0 ] = (1 a)(W) for some 0<a 1. Then for all f such that the expectations exist, we have E(W)f(W ) = E[(W 0 W )(f(W 0 )f(W ))] 2a : (2.6) 13 Proof. By exchangeability, we have 0 =E [(W 0 W )(f(W 0 ) +f(W ))] (2.7) =E [(W 0 W )(f(W 0 )f(W ))] + 2E [(W 0 W )f(W )] =E [(W 0 W )(f(W 0 )f(W ))] 2aE [(W)f(W )]: The last equality follows by conditioning, and after rearranging, this is (2.6). Remark. The conditionE W [W 0 ] = (1a)(W) may seemed constraining, but is (naturally) satised in many examples (for more discussion on this relation see the remarks at the end of Section 3.1). Also, the condition can be relaxed to allow a remainder, but for the sake of simplicity we will not discuss this in detail here (see Theorem 4.1). Dening a null operator of W A 0 f(x) = (x)f(x) E W=x [(W 0 W )(f(W 0 )f(W ))] 2a ; we have from (2.3), (2.4, and Lemma 2.4, d H (W;Z) = sup h2H jEAf h (W )j = sup h2H jEAf h (W )EA 0 f h (W )j = sup h2H 2 E f 0 h (W ) E W [(W 0 W )(f h (W 0 )f h (W ))] 2a 2 : (2.8) 14 In order to simplify (2.8), a second order Taylor approximation could be applied to f h (W 0 ) f h (W ). However, in some cases of interest such as the Kolmogorov metric, f 00 h does not have decent properties, so that for the sake of generality, we write d H (W;Z) sup h2H E f 0 h (W ) 2 E W [(W 0 W ) 2 ] 2a (2.9) + sup h2H E (W 0 W )(f h (W 0 )f h (W )f 0 h (W )(W 0 W )) 2a : = sup h2H E f 0 h (W ) 2 E W [(W 0 W ) 2 ] 2a (2.10) + sup h2H E 2 4 (W 0 W ) R W 0 W 0 f 0 h (W +t)f 0 h (W )dt 2a 3 5 These calculations lead to the following theorem. Theorem 2.5. [78] Let (W;W 0 ) an exchangeable pair of real random variables with the property thatE W [W 0 ] = (1a)W with 0<a 1. Also, letEW = 0,EW 2 = 1, and Z a standard normal random variable. 1. For any familyH of absolutely continuous functions with bounded rst derivatives, d H (W;Z) sup h2H jjf 0 h jjE 1 E W [(W 0 W ) 2 ] 2a + sup h2H jjf 00 h jj EjW 0 Wj 3 4a : (2.11) 2. For all x 0 in R, P(W <x 0 ) 1 p 2 Z x0 1 e x 2 2 dx p Var(E W [(W 0 W ) 2 ]) a + E(W 0 W ) 4 a 1=4 : (2.12) 15 Proof. The rst assertion follows from (2.9) coupled with the Taylor expansion inequality jf h (W 0 )f h (W )f 0 h (W )(W 0 W )jjjf 00 h jj (W 0 W ) 2 2 : (2.13) To prove the second assertion, we apply the rst assertion to \smoothed" indicators, and then pass to the limit at the correct rate. In detail, dene h(x) =I fx<zg ; and h (x) = 8 > > > > > > < > > > > > > : 1 xz; (z +x)= z<xz +; 0 x>z +: The quantity we would like to bound (independent of z) is given byjEh(W )Eh(Z)j. From this point we have Eh(W )Eh(Z) [Eh (W )Eh (Z)] + [Eh (Z)Eh(Z)] (2.14) 2E 1 E W [(W 0 W ) 2 ] 2a + EjW 0 Wj 3 2a + p 2 ; (2.15) where we have used the rst assertion of the theorem, Lemma 2.3, and the fact that a normal random variable has a density bounded by (2) 1=2 . Choosing an appropriate and applying a similar inequality to Eh(Z)Eh(W ), nearly completes the theorem. The nal step is to apply the Cauchy-Schwarz inequality to E 1 E W [(W 0 W) 2 ] 2a andEjW 0 Wj 3 ; and to notice, from the 16 following calculation using exchangeability and the constraints on W in the hypotheses of the theorem, thatE(W 0 W ) 2 = 2a. E(W 0 W ) 2 = 2EW 2 2EW (E W W 0 ) = [2 2(1a)]EW 2 = 2a: (2.16) Remarks. 1. In fact, this argument only yields a constant of p 2(a) 1=4 in the second term of (2.12). The better constant derives from an averaging of the Taylor remainder in (2.13) using the exchangeability of (W;W 0 ); see [78] for the argument. We present the more straightforward approach here for the sake of clarity. 2. From the rst assertion of the theorem, proving the second assertion is only a matter of applying the usual arguments to move from smooth functions to indicators of half lines (as we have done). Unfortunately, this reasoning does not typically give correct rates, as we shall see in the sequel. Next we will present a toy example to clearly illustrate the types of constructions and compu- tations that are needed in order to apply Theorem 2.5. Example 2.6 (Normal Approximation of the Binomial Distribution). Let (X i ) n i=1 a vector of independent Bernoulli random variables each with parameter p, and Y = P n i=1 X i , a binomial random variable with parameters n and p. Dene the random variable W = Y where =np, the mean of Y , and 2 = np(1p), the variance of Y . It is well known that for xed p, as n converges to innity, W converges in distribution to a standard normal random variable. In fact, the Berry-Esseen Theorem states that for some universal constant c, the rate of convergence to normality will be cn 1=2 . We will use Theorem 2.5 to obtain the asymptotically suboptimal rate 17 of n 1=4 for Kolmogorov distance, and the asymptotically optimal rate of n 1=2 in Wasserstein distance. Because W has been dened to have zero mean and unit variance, we can apply the theorem after constructing a random variable W 0 such that (W;W 0 ) is exchangeable and the linearity conditionE W W 0 = (1a)W is satised for some 0<a 1. As discussed in Section 1.1, we dene a reversible Markov chain on the underlying vector of independent Bernoulli random variables which will induce an exchangeable pair. The chain follows the rule of choosing a coordinate at random and resampling independently. More formally, let I be a uniform random variable on the setf1; 2;:::;ng and dene W 0 = Y 0 , where Y 0 = Y X I +X 0 I and (X 0 i ) n i=1 is a vector of independent Bernoulli random variables, each with parameterp. From the computations presented below (in particular the transition probabilities), it is clear that this chain is reversible with respect to the binomial distribution, so that by Proposition 1.2, (W;W 0 ) is exchangeable. To apply Theorem 2.5, we must also check thatE W W 0 = (1a)W for some 0<a 1. When working with expectations of functions of W and W 0 (such as in the error term from Theorem 2.5 and the linearity condition above), it is easier to work directly with the random variableY and then pass toW by substitution. More specically, in order to obtainE W f(W 0 W ) for some function f (such as f(x) = x or f(x) =jxj k ), we rst compute E Y f( Y 0 Y ), and then substitute Y =W +. The rst step in this program is to obtain the conditional probabilities P Y (Y 0 =Y + 1) = nY n p; P Y (Y 0 =Y 1) = Y n (1p); P Y (Y 0 =Y +k) = 0;jkj 2: 18 To see these equalities, notice, for example, that in order forY 0 =Y +1,X I must be zero (occurring with probability (nY )=n) and X 0 I must be one (occurring independently with probability p). From this point, we have E Y [ 1 (Y 0 Y )] = 1 P Y (Y 0 =Y + 1)P Y (Y 0 =Y 1) = Y n : SubstitutingY =W, we obtainE W [W 0 W ] =W=n, so that Theorem 2.5 can be applied with a = 1=n. Now, to compute the error terms from Theorem 2.5, we follow the same strategy as above. E Y jY 0 Yj k k = 1 k p + (1 2p) Y n (2.17) = 1 k 2p(1p) +(1 2p) W n ; so that Var(E W [(W 0 W ) 2 ] = 1 2p n 2 ; and EjW 0 Wj k = 2p(1p) k : 19 Substituting all of this into Theorem 2.5, we obtain Proposition 2.7. Let W be dened as above and Z a standard normal random variable. 1. For d W the Wasserstein distance, we have d W (W;Z) 2j1 2pj + 1 3 : 2. For d K the Kolmogorov distance, we have d K (W;Z) j1 2pj + 2 2 1=4 : Proof. The bound in the Wasserstein metric follows after applying the Cauchy-Schwarz inequality to the rst error term from the rst assertion of Theorem 2.5, and then using the computations above and the bounds of Lemma 2.3. The bound in the Kolmogorov metric follows directly from Theorem 2.5 and the computations above. Even in this simple case, the Kolmogorov bounds are of the order n 1=4 , which is suboptimal. In order to rectify this situation, one needs to go back to (2.10) in the proof of Theorem 2.5 and rene the estimate of E 2 4 (W 0 W ) R W 0 W 0 f 0 h (W +t)f 0 h (W )dt 2a 3 5 (2.18) using the fact that f 0 h (x) = xf h (x) +h(x)Eh(Z), for Z a standard normal random variable. This yields that (2.18) equals E 2 4 (W 0 W ) R W 0 W 0 (W +t)f h (W +t)Wf 0 h (W )dt 2a 3 5 (2.19) +E 2 4 (W 0 W ) R W 0 W 0 h(W +t)h(W )dt 2a 3 5 : (2.20) 20 For the Kolmogorov metric (h equal to indicators of half lines), the term (2.19) can be bounded directly using properties of f h yielding a term of correct order in classical applications. The term (2.20) is more dicult to handle and can be the main technical diculty in non-trivial applications. 1 We will not go into more detail here as the arguments beyond the terms (2.19) and (2.20) are not pertinent to what follows. However, we end this section with what is currently the best out the door exchangeable pairs normal approximation theorem that gives correct rates. Theorem 2.8. [76] Let (W;W 0 ) an exchangeable pair of real random variables with the property that E W [W 0 ] = (1a)W with 0<a 1. Also, let EW = 0, EW 2 = 1, M any positive number, and Z a standard normal random variable. Then for all x 0 in R, P(W <x 0 ) 1 p 2 Z x0 1 e x 2 2 dx p Var(E W [(W 0 W ) 2 ]) a + M 3 2a + 3M 2 + E[(W 0 W ) 2 I fjW 0 Wj>Mg ] 2a : (2.21) Remark. In the case wherejW 0 Wj M almost surely, the last term of (2.21) is zero, so that the theorem simplies neatly. To illustrate, if W is the binomial distribution normalized to have mean zero and variance one, choosingM = (np(1p)) 1=2 , and using the exchangeable pair dened in Example 2.6, Theorem 2.8 yields the correct rate of order n 1=2 . Theorems 2.5 and 2.8 have been used to successfully obtain error bounds in central limit theorems for many applications. The most interesting have been the algebraic examples of [42, 43, 76], and the anti-voter model and U-statistic results of [68]. 1 As the term (2.20) contains expectations of the test function h under study, some concentration inequalities using properties of W can be deduced to handle it in some situations; we discuss this brie y in Subsection 2.6.4. 21 2.4 Poisson Approximation In this section we will implement the program described in Section 2.2 for the Poisson distribution. The rst item of the program is to obtain a characterizing operator for the Poisson distribution. Dene the operatorA by Af(j) =f(j + 1)jf(j): (2.22) Proposition 2.9. Z is a Poisson random variable with mean if and only if EAf(Z) = 0, for all bounded functions f dened on the integers. Proof. We rst prove that for Z a Poisson random variable, we have EAf(Z) = 0. Since P(Z = j) =e j j! for j a non-negative integer, we have Ef(Z + 1) =e 1 X i=0 j j! f(j + 1) = 1 e 1 X i=0 j+1 j + 1! (j + 1)f(j + 1) = 1 EZf(Z): To nish the proof, notice that for a random variable Y such that Ef(Y + 1) =EYf(Y ) for all bounded functions f dened on the integers, we have that P(Y = k 1) = 1 kP(Y = k) (choose f(j) =I fj=kg ). This recursion relation implies, for all k 1, P(Y =k) =P(Y = 0) k k! : The proposition is complete after noting that the probabilities must sum to one. 22 Remark. In general, a good guide for determining a characterizing operatorA p for a discrete distribution X with point mass function p(j) is A p f(j) =f(j + 1) p(j 1) p(j) f(j); as this will implyEA p f(X) = 0. Continuing to Item 2 in the program described in Section 2.2, we must determine the solution of (2.2). We will restrict ourselves to the case where h(j) = I fj2Ag for A some subset of the non-negative integers, that is, to the case of total variation distance, denoted d TV . Proposition 2.10. The solution ofAf A (j) =I fj2Ag P(Z2A) is given by f A (j) = (j 1)! j j1 X k=0 k k! I fk2Ag P(Z2A) ; j 1: (2.23) Proof. With the formula for f A (j) in hand, the proposition is proved by substitution and veri- cation. The next lemma collects some properties of the solution of (2.2) in the case where h(j) = I fj2Ag . The proof is a detailed technical analysis found, for example, in [10]. For bounds of general test functions (not just indicators), see the recent work [31]. Lemma 2.11. [10] For AZ + , f A dened as above, and f A (j) =f A (j + 1)f A (j), jjf A (j)jj 1 (1e ) min(1; 1 ); jjf A (j)jj min(1; 1=2 ): Now that we have handled Items 1 and 2 described in Section 2.2, we can attend to Item 3; using exchangeable pairs to estimatejEAf A (W )j for some random variable W . 23 Analogous to the case of normal approximation, we will simplify EAf A (W ) by subtracting a small (or zero) quantity EA 0 f(W ) forA 0 a null operator of W , closely resemblingA. The next proposition makes this idea more concrete. Lemma 2.12. Let W a random variable supported on the non-negative integers and (W;W 0 ) an exchangeable pair. Then for all bounded f dened on the non-negative integers, E I fW 0 =W+1g f(W + 1) =E I fW 0 =W1g f(W ) : Proof. By exchangeability, we have 0 =E I fW 0 =W+1g f(W 0 )I fW 0 =W1g f(W ) (2.24) =E I fW 0 =W+1g f(W + 1)I fW 0 =W1g f(W ) : Dening the null operator of W , A 0 f(j) =c P W=j (W 0 =j + 1)f(j + 1)P W=j (W 0 =j 1)f(j) ; where c is any constant, we have from (2.3), (2.22), and Lemma 2.12, d TV (W;Z) = sup AZ + jEAf A (W )j = sup AZ + jEAf A (W )EA 0 f A (W )j (2.25) = sup AZ + E f A (W + 1)(cP W (W 0 =W + 1)) E f A (W )(WcP W (W 0 =W 1)) : 24 The calculations above lead to the following theorem. Theorem 2.13. [22] Let W a non-negative integer valued random variable such that E(W ) =, and let (W;W 0 ) an exchangeable pair. LetY a random vector such that W = G(Y) for some function G, c be any constant, and Z denote a Poisson random variable with mean . Then for C = min(1; 1=2 ) and the function f A as dened previously, d TV (W;Z) = sup AZ + E f A (W + 1)(cP Y (W 0 =W + 1)) E f A (W )(WcP Y (W 0 =W 1)) C E cP Y (W 0 =W + 1) +E WcP Y (W 0 =W 1) : Proof. The equality is a rewriting (with the conditioning on the more general random vector Y) of the calculations of (2.25), and the inequality follows from Lemma 2.11. Remark. We write the theorem in terms conditioning on the random vector Y to add greater exibility. For example, if W is a function of random variables, it can be convenient to condition on the vector of those variables. We will present a toy example to clearly illustrate the types of constructions and computations that are needed in order to apply Theorem 2.13. Example 2.14 (Poisson Approximation of the binomial Distribution). LetW aBin(n;p) random variable and let =np, the mean ofY . It is well known that for xed, asn converges to innity, W converges in to a Poisson random variable with mean . In fact, using a dierent variation of Stein's method, for Z a Poisson random variable with mean , [10] obtains the optimal (up to constant) bound d TV (W;Z) minfp;np 2 g: (2.26) 25 We will use Theorem 2.13 to prove the inequality (2.26). The exchangeable pair we will use is the same as in Example 2.6 (W here is denotedY there), with the conditional probabilities the same as well. Choosing c to be n and applying the rst error term from Theorem 2.13, we obtain d TV (W;X) = sup AZ + jE [(f A (W + 1)f A (W ))(pW )]j minf1; 1 gnp 2 ; where the inequality is by Lemma 2.11. This is the desired result. 2.5 Discussion In this section, we will discuss the reason that the error terms from Theorem 2.5 dier so sub- stantially from those of Theorem 2.13. In identifying the seminal distinction between the two approaches we will be able to prove an exchangeable pairs Poisson approximation theorem much in the spirit of Theorem 2.5. The two obvious dierences between normal and Poisson approximation using exchangeable pairs as described above are in the characterizing operators of the target distributions and in the null operators chosen for W . These two issues are related in the sense that the null operator is chosen to match terms of the characterizing operator. To elaborate, consider the characterizing operator for the Poisson distribution with mean given by (2.22), A P f(j) =f(j + 1)jf(j); (2.27) 26 and compare this to the characterizing operator of the normal distribution with mean and variance 2 given by (2.4), A N f(x) = 2 f 0 (x) (x)f(x): (2.28) Recall from this point, the next step in approximating a random variableW by the target distribu- tion is to deneA 0 , a null operator ofW , such thatEA 0 f(W ) is small or zero. In the case where the normal distribution is the target, the null operator is dened to cancel the term (x)f(x) in (2.28) and for the remaining to have a neat Taylor expansion. When the Poisson distribution is the target, the null operator is dened to be compared to each term of (2.27) individually. In order to use the null operator of W for normal approximation in Poisson approximation, it would be ideal to rewrite the Poisson characterizing operator to resemble the normal character- izing operator. This is in fact possible. Notice that the characterizing operator for the Poisson distribution with mean (2.27) can be rewritten as A P f(j) =f(j) (j)f(j); which contains terms similar to those of (2.28). This observation leads to a Poisson approximation theorem (of which a slightly more technical version will be discussed in greater detail in Section 4.1) using the null operator associated to the normal distribution A 0 f(x) = (x)f(x) E W=x [(W 0 x)(f(W 0 )f(x))] 2a : Theorem 2.15. Let W a random variable with non-negative integer support with E(W ) = , Var(W ) = 2 and (W;W 0 ) an exchangeable pair of real random variables such that E W (W 0 W ) =a(W) 27 with 0<a 1. Then with Z a Poisson random variable with mean and c any constant, d TV (W;Z) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a + 1 2 (2.29) + E[jW 0 Wj 3 jW 0 Wj] 2a : (2.30) Proof. Let f A the Stein solution for the Poisson characterizing operator dened by (2.23). Then by Lemma 2.4, we have d TV (W;Z) = sup AZ + jEA P f A (W )j = sup AZ + jEA P f A (W )EA 0 f A (W )j = sup AZ + E f A (W ) E W [(W 0 W )(f A (W 0 )f A (W ))] 2a : Similar to the inequality (2.9), we write d TV (W;Z) sup AZ + E f A (W ) E W [(W 0 W )(W 0 W + 1)] 2a (2.31) + sup AZ + E E W [(W 0 W )(f A (W 0 )f A (W ) f A (W )(W 0 W + 1))] 2a : (2.32) The term (2.31) corresponds to (2.29) in the error; to see this, apply Lemma 2.11 and the triangle inequality to yield E f A (W ) (W 0 W )(W 0 W + 1) 2a E 2a 2 E W [(W 0 W )(W 0 W + 1)] 2a + 1 2 : (2.33) 28 The fact that (2.33) is bounded above by (2.29) follows from the equality E(W 0 W )(W 0 W + 1) = 2a 2 ; (which can be shown along the lines of (2.16)), and then an application of the Cauchy-Schwarz inequality. In order to simplify the proof that (2.32) is bounded above by (2.30), dene D =W 0 W and =f1; 0; 1g. Then we have E D(f A (W 0 )f A (W ) f A (W )(D + 1)) E D(f A (W 0 )f A (W ) f A (W )(D + 1))I fD2g (2.34) + E D(f A (W 0 )f A (W ) f A (W )(D + 1))I fD62g : (2.35) We will rst show that (2.34) is zero and then show (2.35) is bounded above by (2.30). Thus, E D(f A (W 0 )f A (W ) f A (W )(D + 1))I fD2g =E I fD=1g f A (W 0 )I fD=1g f A (W ) = 0: The nal equality follows by the exchangeability of (W;W 0 ). Finally, we claim for D62 , jf A (W 0 )f A (W ) f A (W )(D + 1)j (D 2 1)=; 29 which will prove the theorem. If D 2, then f A (W 0 )f A (W ) f A (W )(D + 1) =jf A (W 0 )f A (W + 1) f A (W )Dj D1 X j=1 f A (W +j) +jf A (W )Dj (jD 1j +jDj)= (D 2 1)=: The penultimate inequality follows by applying Lemma 2.11. The proof of the claim for D2 is nearly identical. Remarks. 1. IfjW 0 Wj 1, then the conclusion of Theorem 2.15 can be restated as d TV (W;Z) p Var[P W (W 0 =W + 1)] a + 1 2 ; (2.36) which implies the main result of [70]. In fact, in this restricted case, (2.36) can be recovered directly from the rst error expression of Theorem 2.13 rewritten slightly. In this way, both Theorems 2.13 and 2.15 can be seen as a generalization of the main result of [70]. 2. The rst term of the error (2.29) from Theorem 2.15 purposefully deviates from the rst term of (2.12) from Theorem 2.5. This dierence not only yields the convenient formulation of (2.30) as described in the previous remark, but is also the critical factor in the utility of Theorem 2.15. That is, it is not hard to see (using the equality (2.17)), that for the Y the binomial distribution and Y 0 as in Example 2.6, p Var(E[(Y 0 Y ) 2 jY ]) 2a = (1 2p)(1p) 1=2 2 1=2 ; which is of constant order in n for xed . 30 2.6 Other Approaches In this section, we will discuss other approaches to Item 3 from the program described in Section 2.2. As each of these approaches have extensive literature, we will be relatively brief as to what we say here. 2.6.1 Size Biasing Size biasing for normal approximation was rst introduced in [8] and expanded to multivariate normal approximation in [52]. Recently, limit theorems have been proved using size biasing in random graph statistics [45], geometric coverage models [49], and urn models [63]. For size biasing in relation to Poisson approximation, the book [10] has much of what is currently known. We highlight here points relevant to the previous discussion. Denition. LetW be a non-negative random variable with mean . CallW s theW -size biased random variable if EWf(W ) =Ef(W s ); (2.37) for all functions f for which the expectations exist. Equivalently, W s is the W -sized biased random variable if dF s (x) =x dF (x) ; where F and F s are distribution functions of W and W s , respectively (this denition implies existence of W s ). From this denition, we build a null operatorA 0 for W in the spirit of (2.27) and (2.28), the characterizing operators for the Poisson and normal distribution, respectively. Dene A 0 f(W ) = (x)f(x)E W=x [f(W s )f(x)]; 31 and notice that from the denition of W s ,EA 0 f(W ) = 0. Thus, we have forZ a normal random variable with mean and variance 2 , d H (W;Z) = sup h2H jEA N f h (W )j = sup h2H jEA N f h (W )EA 0 f h (W )j = sup h2H 2 E h f 0 h (W ) 2 E W [f h (W s )f h (W )] i : This calculation and an analysis similar to that of normal approximation with exchangeable pairs yields a theorem in the spirit of Theorems 2.5 and 2.8 [45]). For X a Poisson random variable with mean , we have the string of equalities analogous to those of Section 2.5. d TV (W;X) = sup AZ + jEA P f A (W )j = sup AZ + jEA P f A (W )EA 0 f A (W )j = sup AZ + E f A (W )E W [f(W s )f(W )] : = sup AZ + jE [f A (W + 1)f A (W s )]j: Invoking Lemma 2.11, this implies the clean result d TV (W;X) minf1;gEjW + 1W s j: There is very much that can be done with this simple inequality; it is one of the central identities of [10]. 32 2.6.2 Zero Biasing Zero biasing was rst introduced in [50] for normal approximation and has been extended to multivariate normal approximation in [51]. Recent applications can be found in [45] and [46], and a neat theoretical result in [48]. We highlight here points relevant to the previous discussion. Denition. Let W a random variable with mean zero and variance one. Call W z the W -zero biased random variable if EWf(W ) =Ef 0 (W z ); for all dierentiable functions f for which the expectations exist. It can be shown that the zero bias distribution always exists and has amenable properties [50]. From this denition, we are able to build a null operatorA 0 for W in the spirit of (2.28), the characterizing operator for the normal distribution. Dene A 0 f(x) =xf(x)E W=x f 0 (W z ); and notice that from the denition of W z ,EA 0 f(W ) = 0. Thus, we have forZ a normal random variable with mean zero and variance one, d H (W;Z) = sup h2H jEA N f h (W )j = sup h2H jEA N f h (W ) +EA 0 f h (W )j = sup h2H jE [f 0 h (W )f 0 (W z )]j: A Taylor expansion yields a result in the spirit of the rst assertion of Theorem 2.5. 33 Theorem 2.16. [45] For any familyH of absolutely continuous functions with bounded rst derivatives, d H (W;Z) sup h2H jjf 00 h jjEjWW z j: 2.6.3 Dependency Graphs If W = P n i=1 X i where each X i is independent, then estimates ofEA N f(W ) andEA P f(W ) can be made directly. For Poisson approximation for example, if eachX i is aBer(p i ) random variable, then using the notation from Section 2.4, we have the following. jEA P f A (W )j = E " n X i=1 p i (f A (W + 1)f A (WX i + 1) # n X i=1 p 2 i jjf A jj min 8 < : 1; n X i=1 p i ! 1 9 = ; n X i=1 p 2 i : This result generalizes (2.26) from the example of Poisson approximation of the binomial dis- tribution using exchangeable pairs (the more general result could be obtained there with minor modication). First utilized in [25] for Poisson approximation and [78] for normal approximation, the depen- dency graph approach generalizes the type of computation above to the case where the random variable to be approximated is a sum of locally dependent random variables. That is, for each i, there are small \neighborhoods" of indicesN i such that X i is independent (or \weakly" depen- dent) of X j for all j not inN i . For Poisson approximation, this approach is very powerful due to the many applications that have representations as sums of locally dependent indicator random variables; a thorough in- troduction can be found in [2, 3, 10]. For Normal approximation, dependency graphs are more 34 dicult to use, rst attempts [7, 8] provided suboptimal rates; a modern exposition can be found in [27] and references therein. 2.6.4 Ad Hoc Methods In this section we mention some other approaches to Item 3 of Section 2.2. The criteria for being termed \ad hoc" is that these methods usually involve more thought intensive constructions or have not been rened to a point with nice out the door theorems. These methods require detailed information about the random variable W far beyond a coupling construction and some moment information as in the previous sections. The rst ad hoc method we will mention for handling Item 3 from the program described in Section 2.2, is Bolthausen's inductive method [14] for normal approximation. In [14], the author obtains asymptotically correct error rates for sums of i.i.d. random variables and also in the so called combinatorial central limit theorem. His method was later used to obtain rates in the convergence of random character ratios to the normal distribution [40]. The main idea behind Bolthausen's method is as follows. Let W n a sequence of random variables (for example the sum ofn i.i.d. random variables) and let(n) =d K (W n ;Z) forZ a standard normal random variable. Also, let (n) = sup z2R jEh (W n )Eh (Z)j; where h is dened as in the proof of Theorem 2.5. Standard smoothing arguments of the form (2.14) imply (n) (n) + p 2 : (2.38) 35 Bolthausen's method uses an inequality on the Stein solution f h to obtain an inequality of the form (n)O(n 1=2 ) +c (n 1) p n ; (2.39) where c is some constant. After choosing n 1=2 , (2.38) and induction implies (n) = O(n 1=2 ). This argument is essentially the same idea as the inequality giving (2.15) in Theorem 2.5, the exchangeable pairs normal approximation theorem. However, an extra factor of n 1=2 is gained through the induction step. Although this method seems systematic, it is decidedly ad hoc as it can be very dicult to obtain an inequality of the form (2.39). 2 The second ad hoc method we will discuss is Stein's concentration inequality approach as communicated and elaborated on by Chen [54]. The concentration inequality approach has been used to prove multiple Berry-Esseen type normal approximation theorems for sums of independent random variables [26, 54], for sums of locally dependent random variables [27], and for nearly linear statistics of independent random variables [29]. A clear formulation can be found in [28]. Similar to other methods, the main idea is to obtain a null operatorA 0 of W , specically having the form A 0 f(x) =xf(x)E W=x Z 1 1 f 0 (W +t)K(t)dt; whereK(t) is a \concentration" function depending on random quantities (ideally with the prop- erty thatE R 1 1 K(t)dt = 1), and W is some random variable depending on W (usually W +Y 2 There is a connection between Bolthausen's method and zero biasing [47]; the main idea is that the zero bias coupling can be reductive in some sense, which is amenable to obtaining inequalities similar to (2.39). 36 where Y is some small random variable). Then if EA 0 f(W ) = 0 for all appropriate functions f, we obtain d H (W;Z) = sup h2H jEA N f h (W ) +EA 0 f h (W )j sup h2H E f 0 h (W ) 1 Z 1 1 K(t)dt (2.40) + sup h2H E Z 1 1 (f 0 h (W )f 0 h (W +t))K(t)dt : (2.41) This situation is familiar: by choosing an appropriateW andK(t) we can recover (2.10). 3 From this point, (2.41) can be rewritten in terms analogous to (2.19) and (2.20), the latter of which is sup h2H E Z 1 1 [h(W +t)h(W )]K(t)dt : K(t) should be dened in such a way as to be zero outside of a small interval, so that the term above is bounded by quantifying the dierence between W and W and then bounding the concentration of W on the non-zero part of K(t). 4 We conclude the discussion on this approach by mentioning that the same ideas can be exploited in a zero biasing framework [47]. The nal ad hoc method of Stein's method for distributional approximation we will discuss is the recent work of Chatterjee [20, 21]. The basic idea (rst exploited in [17]) is, for a given random variable W , to dene a random variable T such that EWf(W )Tf 0 (W ): (2.42) 3 IfK(t) = (2a) 1 (W 0 W ) I f0tW 0 Wg I fW 0 Wt<0g andW =W , we have R 1 1 K(t)dt = (W 0 W) 2 2a , so that (2.40) and (2.41) are equal to (2.10). 4 This approach will only be useful if h(W +t)h(W ) has a nice form; this is the case for h equal to half line indicators, but not for general Borel sets (that is, the Kolmogorov and total variation distance metrics respectively). 37 These denitions imply that d H (W;Z) sup h2H E f 0 (W )(1E W T ) ; whereZ is a standard normal random variable. The main diculty with this approach is dening T in a useful form. The examples of [17] are restrictive and the denition and properties of T are transferred to other quantities, such as the Fisher information of W . Chatterjee's main contribution is to construct T as in (2.42) in non-trivial examples where W is a function of independent random variables. This same idea can also be exploited in Poisson approximation [62]. For a given random variable W , dene the random variable T such that E(W)f(W )T f(W ): (2.43) These denitions imply that d TV (W;Z) sup AZ + E f(W )(E W T ) ; (2.44) where Z is a Poisson random variable with mean . Once again, the main diculty is usefully constructing the random variable T . The examples of [62] are simple (the binomial and negative binomial distributions) and the errors are written in terms of a discrete analog of Fisher informa- tion. Also, as mentioned in [70], if the hypotheses of Theorem 2.15 are satised and additionally jW 0 Wj 1 as in the remark following the theorem, T = a 1 I fW 0 =W+1g satises (2.43) with equality, and applying this to (2.44) recovers the result (2.36). 38 Chapter 3 The Eect of the Exchangeable Pair In this chapter, we will examine how modifying the step size of the underlying Markov chain in a natural way aects the error term acquired through Stein's method of exchangeable pairs for normal and Poisson approximation in some of the Theorems discussed previously. In the case where the underlying Markov chain is ergodic, the step size does not necessarily aect the rate of convergence to stationarity in a monotone way. However, the rate of convergence is related to the eigenvalues of the Markov chain, and in the examples associated to the normal distribution we are able to express the bound on the error in terms of the eigenvalues. It will be obvious in the sequel that modifying the exchangeable pair has a profound eect on the error term. Most notably, in the theorems we use, Markov chains with larger steps require more computational work, and in the case of Poisson approximation, higher moment information. For the examples presented here, the chains that allow for the easiest computation always yield the best bound. For other examples it is dicult to compute the error term for any chain other than the most computationally simple in a form that yields information about the relative sizes of the bounds. Thus, it is dicult to take these examples and make a rigorous statement about the step size of the underlying Markov chain and the bound acquired from it in a general setting. Section 3.1 non-rigorously discusses the eect of step size on the error term in Stein's method of exchangeable pairs for normal approximation (specically Theorem 2.5). Sections 3.2, 3.3, and 39 3.4 each contain one example of Stein's method's approximation of respectively, the binomial (with p = 1=2), Plancherel measure of the Hamming scheme (or the binomial distribution in general), and Plancherel measure of the irreducible representations of a group G by the normal distribution. Section 3.5 is tangent to Sections 3.2, 3.3, and 3.4 in that the bounds on the error from those sections are restated in terms of the eigenvalues of the chain. The nal two sections examine the approximation of the binomial and the negative binomial by the Poisson distribution. 3.1 Normal Approximation For normal approximation, we use Stein's original theorem [78], Theorem 3.1 stated below (a restatement of the second assertion of Theorem 2.5 above), not only because most other ex- changeable pair formulations stem from it, but because the error terms are similar to the error terms that occur in other formulations. Theorem 3.1. [78] Let (W;W 0 ) an exchangeable pair of real random variables such thatE W W 0 = (1a)W with 0<a 1. Also, let E(W ) = 0 and E(W 2 ) = 1. Then for all x 0 in R, P(W <x 0 ) 1 p 2 Z x0 1 e x 2 2 dx p Var(E W [(W 0 W ) 2 jW ]) a + E(W 0 W ) 4 a 1=4 : (3.1) Remark. It has been shown [71] that Theorem 3.1 still holds without exchangeability assuming instead thatW andW 0 are equally distributed. However, it is unclear how useful this observation is in practice as the exchangeability plays a critical role in dening the pair (W;W 0 ) and in computing the error from Theorem 3.1. It has been casually noted that the error term from Theorem 3.1 should be small when a is small, or equivalently whenW andW 0 are \close." One line of thought [68] that supports this idea is that if (W;W 0 ) is bivariate standard normal with covariance (1a), thenVar(E W [(W 0 W ) 2 ]) = 40 a 4 Var(W 2 ). This equation loosely implies that smaller values of a should yield better bounds if (W;W 0 ) are nearly bivariate normal. Another argument for the idea that the error term should be small when W and W 0 are close is that the bound from Theorem 3.1 follows from a Taylor series approximation ofW aboutW 0 . A simple heuristic to illustrate this argument can be found in [66] and [69]. However, the Taylor approximation takes place exclusively in the numerator of the error term and so this argument does not take into account the denominator which also decreases when W and W 0 are close (the same observations can be made directly from the error term). In accordance with these remarks, we will take the value a generated by the exchangeable pair (W;W 0 ) of Theorem 3.1 to be a rough quantitative measure of the \step size" as referred to in the introduction. Finally, we remark here that the families of chains from Sections 3.3 and 3.4 have a similar form. Both of these families of chains are canonically engineered to satisfy the linearity condition E W W 0 = (1a)W . To elaborate, if as described in Section 1.1, (W;W 0 ) = (W (X 0 );W (X 1 )) (that is,fX 0 ;X 1 ;:::g is a stationary Markov chain reversible with respect to some distribution on a state space and W is some function from intoR), then we have E[W 0 jX 0 =i] = X j2 P(X 1 =jjX 0 =i)W (j) = (PW) i ; where P is the transition matrix of the Markov chain and W is the vector W i =W (i). Roughly, the linearity condition implies PW = (1a)W; so that W must be an eigenvector of the transition matrix P . In both of Sections 3.3 and 3.4 the random variableW under study is a member of an orthog- onal (in theL 2 sense with respect to the measure under study) family of functions on the state space. From this point, it is not dicult to canonically dene a matrix indexed by with row sums equal to one, which has the orthogonal family as eigenvectors and satises the detailed 41 balance equations for the measure (although the entries are not guaranteed to be positive). We omit a more detailed abstract formulation, but the main ideas can be found in [75]. 3.2 Binomial Distribution In this section, we examine the most basic example of Stein's method for normal approximation. It is well known that the binomial distribution with parameters n andp is approximately normal for largen. Stein's method can be used to obtain an error term in this approximation. The setup is as follows: let (X 1 ;X 2 ;:::;X n ) a random vector where each X i is independent and equal to 1 with probability p (0<p< 1) and 0 with probability (1p). Then take X = P n i=1 X i . In order to clearly illustrate the typical way we will analyze the rest of the examples, we will take p = 1=2 in this section. The rst thing we need to study the error term from Theorem 3.1 is the family of Markov chains that induce the exchangeable pair. Given a conguration of an n dimensional 0 1 vector, the next step in the chain follows the rule of changing any xed i coordinates with probability b i = n i , so that P n i=0 b i = 1 (b i is the probability of moving to a conguration Hamming distance i away). Since the probability of going from a conguration x to a conguration y is the same as going fromy tox, the random vector (X 1 ;:::;X n ) with each coordinate an independent Bernoulli random variable with parameterp = 1=2 is clearly reversible with respect to these chains. Setting X = P n i=1 X i and X 0 = P n i=1 X 0 i where (X 0 1 ;:::;X 0 n ) is a step in the chain described above induces an exchangeable pair. In order to apply Theorem 3.1, the random variable must have mean 0 and variance 1. SetW = XE(X) p Var(X) = Xnp p np(1p) , and deneW 0 asW but withX 0 in place of X. The nal hypothesis from Theorem 3.1 is the following lemma. Lemma 3.2. E W W 0 = 1 2 n X i=1 i n b i ! W: Proof. Let X = P n i=1 X i and X 0 = P n i=1 X 0 i as dened above. 42 E(X 0 XjX) = n X i=1 E(X 0 i X i jX) = X i:Xi=1 (1)P(X 0 i = 0jX i = 1) + X i:Xi=0 P(X 0 i = 1jX i = 0) = (n 2X) n X i=1 i n b i : Substituting X = p n=4W +n=2 and X 0 = p n=4W 0 +n=2 yields that E W (W 0 W ) = 2 P n i=1 i n b i W , which is the lemma. Now Theorem 3.1 can be applied with a = 2 P n i=1 i n b i . In order to apply the theorem, we still need to compute the quantities Var(E W [(W 0 W ) 2 ]) andE(W 0 W ) 4 . Lemma 3.3. Var(E W [(W 0 W ) 2 ]) = 16B 2 Var(W 2 ); where B = P n i=2 i(i1) n(n1) b i : Proof. E[(X 0 X) 2 jX] = n X i=1 E[(X 0 i X i ) 2 jX) + X i:Xi=1 j6=i:Xj =0 E[(X 0 i X i )(X 0 j X j )jX) + X i:Xi=1 j6=i:Xj =1 E[(X 0 i X i )(X 0 j X j )jX) + X i:Xi=0 j6=i:Xj =0 E[(X 0 i X i )(X 0 j X j )jX) =nA + (X(X 1) + (nX)(nX 1) 2X(nX))B; 43 where A = P n i=1 i n b i . Substituting X = p n=4W +n=2 and X 0 = p n=4W 0 +n=2 into the equation and solving appropriately yieldsE W [(W 0 W ) 2 ] = 4BW 2 +C whereC is some constant. Taking variances proves the lemma. Lemma 3.4. E(W 0 W ) 4 = 16 n A + 3B(n 1) + (2 3n +nE(W 4 ))D ; where A = P n i=1 i n b i , B is dened as in Lemma 3.3, and D is a constant. Proof. Let Y i =X 0 i X i . E (X 0 X) 4 jX = n X i=1 E Y 4 i jX + 4 X i j6=i E Y 3 i Y j jX + 3 X i j6=i E Y 2 i Y 2 j jX + 6 X i j6=i l6=i;j E Y 2 i Y j Y l jX + X i j6=i l6=i;j m6=i;j;l E [Y i Y j Y l Y m jX]: Counting the number of each type and computing the expected values in a manner similar to the proof of Lemma 3.3 yields E[(X 0 X) 4 jX] =nA + (4B + 6(n 2)C)(n(n 1) 4X(nX)) + [n(n 1)(n 2)(n 3) 8((nX)X(X 1)(X 2) +X(nX)(nX 1)(nX 2))]D + 3n(n 1)B: 44 HereC andD are constants that vanish in the nal expression, butC is the probability of any xed three coordinates changing, andD is the same probability but with four coordinates. Substituting inW andW 0 as in the previous lemmas and taking the expected value ofE W (W 0 W ) 4 implies the lemma. Lemma 3.5. E[W 4 ] = 3 2 n : Proof. The moment generating function for W is (t) = 1 2 h exp 1 n 1=2 t i + 1 2 h exp 1 n 1=2 t i n ; andE[W 4 ] = (4) (0). Now that we have all the formulas needed to apply Theorem 3.1, we can prove the main result of this section. Theorem 3.6. Using the family of reversible Markov chains described previously, the error term given by Theorem 3.1 is minimum for b 0 +b 1 = 1 (andb 0 6= 1). In this case the bound is 8 n 1=4 . Proof. Because a is positive, it is sucient to verify that Var(E W [(W 0 W) 2 ]) a 2 and E(W 0 W) 4 a are minimum for b 0 +b 1 = 1. First we examine Var(E W [(W 0 W) 2 ]) a 2 . By Lemmas 3.2, 3.3, and 3.5 and the fact that a = 2A, this term is equal to 8 n 1 n B 2 A 2 ; which is non-negative and equal to zero when b 0 +b 1 = 1. For the second term we use the second formulation in Theorem 3.1 and nd the minimum of E(W 0 W ) 4 a : (3.2) 45 By Lemmas 3.2, 3.4, and 3.5, (3.2) is equal to 8 n 1 + 3(n 1) B A ; which is also minimized when b 0 +b 1 = 1, and is equal to 8=n in this case. Remarks. 1. It is important to notice that the error term from Theorem 3.1 depends on the Markov chain only through the term B=A, where B = P n i=1 i(i1) n(n1) b i and A = P n i=1 i n b i . We will be using this fact later in Section 3.5. 2. The case b 0 +b 1 = 1 corresponds to the Markov chain that holds with probability b 0 and changes one coordinate chosen uniformly at random with probability b 1 . Therefore, this chain has the smallest maximum step size over all the chains in the family under study. Also, as mentioned in the previous section, a quantitative measure of step size is the associated value of a from Theorem 3.1, which is equal to 2b 1 =n in this optimal case. Restricting to the case where b 0 = 0, the chain yielding the best bound (b 1 = 1) has the smallest value of a. This restriction is not articial; in lieu of the previous remark, the chain generated with P(T = t) = b 0 t , where b 0 0 = 0, and b 0 i = b i =(1b 0 ) for i6= 0 has the same error term from Theorem 3.1 as the chain with P(T = t) = b t . In other words, manipulating the holding probability changes the parameter a, but can not yield bounds better than those obtained by choosing b 0 = 0. 3.3 Plancherel Distribution of the Hamming Scheme In this section we examine the uniform distribution on the eigenvalues of the adjacency matrix for the Hamming graph. The Hamming graph H(n;q) has vertex set X equal to n-tuples of 46 f1; 2;:::;qg (thusjXj = q n ) with an edge between two vertices if they dier in exactly one coordinate. The following information about the adjacency matrix of the hamming scheme can be found in [9] in the more generalized setting of association schemes. Let v i = (q 1) i n i , the number of vertices that dier from a xed vertex by i coordinates. The eigenvalues of H(n;q) are K 1 (i) = n(q1)qi with multiplicityv i fori = 0; 1; 2;:::;n. Choose i with probability vi jXj and designate this the Plancherel distribution of the Hamming Scheme. Let W (i) = K1(i) p v1 a random variable with unit variance. In order to dene the family of Markov chains that induce the exchangeable pairs, we must dene the q-Krawtchouk polynomials: K j (i) = j X l=0 (1) l (q 1) jl i l ni jl : Here and in what follows we freely use the convention m r = 0 forr>m orr< 0. Following [42], dene L T (i;j) = v j jXj n X r=0 K r (i)K r (T )K r (j) v 2 r : (3.3) For a given T inf1;:::;ng, dene a Markov chain onf0;:::;ng by the transition probability of moving from i to j as L T (i;j). Then L T (i;j) is a Markov chain onf0;:::;ng reversible with respect to the Plancherel distribution above P(i) = vi jXj . Following the usual setup, choose i from the Plancherel distribution and then j with probability L T (i;j) and set the exchangeable pair (W;W 0 ) = (W (i);W (j)). In [42], the author uses Theorem 3.1 with the exchangeable pair induced by the chainL 1 (dened by (3.3) withT = 1) to obtain a bound on the dierence between the normal distribution and the Plancherel measure. We will show that over a large family of Markov chains, L 1 is the most local chain and the error term obtained in Theorem 3.1 using L 1 (as was done in [42]) is optimal over this family. 47 Another (somewhat more motivating) way of viewing the Hamming scheme is given in the following easily veried proposition. Proposition 3.7. [42] For W dened as above, W is equal in distribution to a binomial distri- bution with parameters n and p = 1 q normalized to have mean 0 and variance 1. Thus the Plancherel distribution of the Hamming scheme can be dened as a binomial distri- bution. We will redene the Markov chain L T (i;j) in terms of this characterization, but rst we list some well known properties of Krawtchouk polynomials found in [59]. Lemma 3.8. For j;l2f0;:::;ng, n X i=0 K i (j)K i (l) v i = jXj v j j;l : Lemma 3.9. For j;l2f0;:::;ng, n X i=0 v i K j (i)K l (i) =jXjv j j;l : Lemma 3.10. For j2f0;:::;ng, (j + 1)K j+1 (i) = ((nj)(q 1) +jqi)K j (i) (q 1)(nj + 1)K j1 (i): Lemma 3.11. For i;j2f0;:::;ng; K j (i) = (q 1) ji n j n i K i (j): Finally, we need one more tool that equates the product of two Krawtchouk polynomials with a linear combination of Krawtchouk polynomials. 48 Lemma 3.12. For i;j;r2f0;:::;ng; K i (r)K j (r) = j X l=j A j;i+l (i)K i+l (r); where A j;i+l (i) = n i n i+l j X k=0 jk kl ni k i jk (q 2) j2k+l (q 1) lk : Proof. The rst thing to note is that the Krawtchouk polynomialsK i (r) fori non-negative integers form a basis for all polynomials in r, so that such a decomposition exists. Now, x i and write A j;i+l (i) =A j;i+l . For j = 0, A 0;i = 1 and A 0;i+l = 0 for l6= 0, which agrees with the lemma. Also, since K 1 (r) = (q 1)(nr)r = (q 1)nqr, Lemma 3.10 implies K i (r)K 1 (r) =i(q 2)K i (r) + (i + 1)K i+1 (r) + (q 1)(ni + 1)K i1 (r); (3.4) which is also consistent with the lemma (the equality (3.4) was shown in [42]). For j 1, we use Lemma 3.10 and strong induction to obtain (j + 1)K j+1 (r)K i (r) = ((nj)(q 1) +jqr)K j (r)K i (r) (q 1)(nj + 1)K j1 (r)K i (r) = ((nj)(q 1) +jqr) j X l=j A j;i+l K i+l (r) (q 1)(nj + 1) j1 X l=(j1) A j1;i+l K i+l (r): 49 After simplifying the terms above, we have (j + 1)K j+1 (r)K i (r) = ((nj)(q 1) +j) j X l=j A j;i+l K i+l (r) (q 1)(nj + 1) j1 X l=(j1) A j1;i+l K i+l (r) (3.5) + j X l=j [(l +i + 1)K l+i+1 (r) + (q 1)(nil + 1)K i+l1 (r) ((nil)(q 1) +l +i)K i+l (r)]A j;i+l ; where the nal equality is by Lemma 3.10. For each l, the coecient of K i+l (r) in (3.5) is (i +lj)(q 2)A j;i+l + (q 1)(nil)A j;i+(l+1) + (i +l)A j;i+(l1) (q 1)(nj + 1)A j1;i+l : (3.6) The lemma will follow if the expression above is equal to (j +1)A j+1;i+l . To see this fact, re-index the sums in the denition of A j;i in (3.6) to begin at one, and equate summands. The fact from the previous lemma that the coecients in the linear expansion are positive for q 2 has been shown without explicit computation in [37] and restated in the monograph [6]. We use that the coecients are positive in the next theorem which shows they can be used to dene a probability distribution. Theorem 3.13. For q 2, the Markov chain L T (i;j) dened onf0;:::;ng has the same transi- tion probabilities as the following chain: Given a 0 1 n-tuple with i ones, choose T coordinates at random. Replace every zero coordinate chosen to a one and for each one coordinate chosen, replace it with a zero with probability 1 q1 and let the coordinate remain as a one with probability q2 q1 . The probability of going from i ones to j ones is L T (i;j). 50 Proof. By summing over k equal to the number of zeros in the T coordinates chosen, the proba- bility of going from i to i +l in the chain described is T X k=0 Tk kl (q 2) T2k+l (q 1) Tk ni k i Tk n T ! : Also, L T (i;j) = v j jXj n X r=0 K r (i)K r (T )K r (j) v 2 r = v j jXj n i n j n T (q 1) i+j+T n X r=0 (q 1) r n r K i (r)K T (r)K j (r) = v j jXj n i n j n T (q 1) i+j+T n X r=0 v r K j (r) T X l=T A T;i+l K i+l (r) = v j jXj n i n j n T (q 1) i+j+T T X l=T A T;i+l n X r=0 v r K i+l (r)K j (r): The rst equality is by Lemma 3.11 and the second is by Lemma 3.12. Applying Lemma 3.9 implies L T (i;i +l) = v i+l n i n T (q 1) i+T A T;i+l : By Lemma 3.12, this equals the desired quantity. According to Proposition 3.7, the restrictionq 2 corresponds to a binomial distribution with parameters n and p with 1=2 p < 1. However, after normalizing to have mean equal to zero and unit variance, a binomial random variable with parameters n and (1p) is the negative of a binomial random variable with parameters n and p. Therefore, because the normal distribution is symmetric about zero, the following analysis can be applied to any binomial random variable. 51 The chains dened by (3:3) have now been fully described in terms of the binomial distribution. Now we will start to examine the error term from Theorem 3.1. The next lemma shows the quantity a from Theorem 3.1 is equal to qT n(q1) so that a is increasing as a function of T . Also, Theorem 3.13 implies that the maximum step size of L T is T ; smaller values of T make W and W 0 \closer" in both senses described in Section 3.1. Lemma 3.14. [42] E W W 0 = K1(T) v1 W: Proof. E[W 0 ji] = 1 p v 1 n X j=0 L T (i;j)K 1 (j) = 1 p v 1 jXj n X r=0 K r (T )K r (i) v 2 r n X j=0 v j K r (j)K 1 (j) = K 1 (T ) v 1 W (i): The rst equality is by denition and the second uses Lemma 3.9 directly. Since conditioning on i only depends on i through W , the lemma is proven. For this example, the error term (in the more general setting of association schemes) has already been computed using Theorem 3.1 in [42], so we only state the result we need. First we must dene a function p 2 onf0;:::;ng using the following description. Start from a xed X 0 in X and choose a coordinate uniformly at random. Replace the chosen coordinate by one of the remaining q 1 options dierent from the original value uniformly at random to obtain X 1 . Perform the same operation on X 1 to obtain X 2 and dene p 2 (j) to be the probability that X 2 has j coordinates dierent then X 0 . Thus p 2 (0) = 1 n(q1) , p 2 (1) = (q2) n(q1) , p 2 (2) = n1 n , and p 2 (j) = 0 for j 3. 52 Theorem 3.15. [42] Let W and a dened as above and let T be xed inf1;:::;ng. Then for all real x 0 , P(W <x 0 ) 1 p 2 Z x0 1 e x 2 2 dx v 1 a v u u t n X j=1 p 2 (j) 2 v j K j (T ) v j + 1 2K 1 (T ) v 1 2 + p v 1 1=4 2 4 n X j=0 8 6 a 1 K j (T ) v j p 2 (j) 2 v j 3 5 1=4 : Armed with this theorem, we can examine how varying the value of T aects the error term. However, rather than examining the error for a xed value of T , we dene T to be a random variable onf0;:::;ng with P (T = t) = b t (and b 0 6= 1). Note that this modication does not aect the stationary distribution of the chain. There are two reasons for using a random variable instead of a xed value ofT . The rst reason is that using a random variable yields a larger family of Markov chains. The second reason is that whenq = 2, the Plancherel distribution is a binomial distribution with parametersn andp = 1=2. In this case, Theorem 3.13 implies the chainL t (i;j) choosest coordinates with probabilityb t and changes them all. That is, L t (i;j) follows the rule of moving a (Hamming) distance t away with probability b t and we recover the chain from Section 3.2. We will show that the random variable T that minimizes the error term from Theorem 3.1 is induced by the Markov chain used in [42] which has b 1 = 1 (or alternatively b 0 +b 1 = 1 and b 0 6= 1) and all other b t = 0. The computation of the error term with T as a random variable closely follows the proof of Theorem 3.15 from [42]. To apply Theorem 3.1, we need to compute the terms in the bound on the error. It will be helpful to dene (W t ;W 0 t ) to be the exchangeabe pair dened in the usual way from the Markov chain L t (i;j). Note that W t = W since using a dierent Markov chain in the family does not alter the stationary distribution. 53 The next two lemmas will generate all the terms needed to apply Theorem 3.1. Lemma 3.16. [42] E[(W 0 t W t ) 2 ji] =v 1 n X r=0 K r (i)K r (t) v 2 r p 2 (r) + 1 2K 1 (t) v 1 W 2 t : Var(E[(W 0 t W t ) 2 ji]) =v 2 1 n X r=1 p 2 (r) 2 v r 1 + K r (t) v r 2K 1 (t) v 1 2 : E[(W 0 t W t ) 4 ] =v 2 1 " n X r=0 8 1 K 1 (t) v 1 6 1 K r (t) v r p 2 (r) 2 v r # : Lemma 3.17. E W [W 0 ] = n X t=0 b t K 1 (t) v 1 ! W: E[(W 0 W ) 2 ji] =v 1 n X r=0 K r (i) ( P n t=0 b t K r (t)) v 2 r p 2 (r) + 1 2 ( P n t=0 b t K 1 (t)) v 1 W 2 : Var(E[(W 0 W ) 2 ji]) =v 2 1 n X r=1 p 2 (r) 2 v r 1 + n X t=0 b t K r (t) v r 2K 1 (t) v 1 ! 2 : E[(W 0 W ) 4 ] =v 2 1 n X r=0 8 1 P n t=0 b t K 1 (t) v 1 6 1 P n t=0 b t K r (t) v r p 2 (r) 2 v r : Proof. We prove only the rst equality; the proofs of the remaining are similar. By summing over t equal to the value of T chosen, we have E[W 0 ji] = 1 p v 1 n X t=0 b t n X j=0 L t (i;j)K 1 (j): The equality now follows from the proof of Lemma 3.14. Applying Lemma 3.17 to Theorem 3.1 proves the following theorem. 54 Theorem 3.18. If W andfb t g n t=0 are dened as above, a = 1 P n t=0 b t K1(t) v1 , and all other variables are as in Theorem 3.15, then for all real x 0 , P(W <x 0 ) 1 p 2 Z x0 1 e x 2 2 dx v 1 a v u u t n X j=1 p 2 (j) 2 v j P n t=0 b t K j (t) v j + 1 2 ( P n t=0 b t K 1 (t)) v 1 2 + p v 1 1=4 2 4 n X j=0 8 6 a 1 P n t=0 b t K j (t) v j p 2 (j) 2 v j 3 5 1=4 : We can now analyze the error term of Theorem 3.18 overL T (i;j), the family of Markov chains previously dened. Theorem 3.19. The Markov chain L T (i;j) that minimizes the error term from Theorem 3.18 is the chain with P(T = 1) =b 1 = 1. Proof. The right hand side of the inequality of Theorem 3.18 can be rewritten as v 1 v u u u t n X j=1 p 2 (j) 2 v j 0 @ P n t=0 b t Kj (t) vj 1 a + 2 1 A 2 (3.7) + p v 1 1=4 2 4 n X j=0 0 @ 8 + 6 0 @ P n t=0 b t Kj (t) vj 1 a 1 A 1 A p 2 (j) 2 v j 3 5 1=4 : (3.8) We will show that P n t=0 b t Kj (t) vj 1 a (3.9) is minimum for each j under some setB =fb t g n t=0 , which implies (3.8) is minimum underB (notice v j 0 for all j). We will then show that the minimum value of (3.9) is no less than2 for each j, so that (3.7) is also minimum underB. 55 Since p 2 (j) = 0 for j 3, we only consider j = 0; 1; 2. For j = 0, K 0 (t) = v 0 = 1, so that (3.9) does not depend on t. For j = 1, the numerator of (3.9) is equal toa so that (3.9) is independent of t. For j = 2, a straightforward calculation (using P n t=0 b t = 1) yields P n t=0 b t K2(t) v2 1 a + 2 = q q 1 P n t=2 b t t(t1) n(n1) P n t=1 b t t n : (3.10) Since the summand in the numerator is 0 when t = 1 and positive for all other t values (whereb t is positive), the nal term in (3.10) is minimum for b 1 +b 0 = 1 and b t = 0 for t> 1. Remarks. 1. Similar to Section 3.2, Theorem 3.13 implies that the case b 0 +b 1 = 1 corresponds to the Markov chainL T having step size at most one and with associated value ofa from Theorem 3.1 equal to qb1 (q1)n . Among chains with b 0 = 0 the chain yielding the best bound has the smallest maximum step size and the smallest value of a. We refer to the remarks following Theorem 3.6 on this restriction. 2. It is interesting to note that the size of the bound on the error in this section and in Section 3.2 both depend on the underlying chain through the same term (B=A from Section 3.2). This is obvious from the case q = 2, since this case is the same in both sections, but it is not clear why this should carry over to other values of q. 3.4 Plancherel Distribution on a Group In this section we examine the Plancherel measure of the random walk generated by the conjugacy class of transpositions on S n (the symmetric group on n symbols). First we describe the setup for any group in order to state the theorems we will use in the utmost generality. Let G be a group and C be a nontrivial conjugacy class of G such that C = C 1 . Dene the random walk on G generated by C, a Markov chain with state space G, as follows: given g in G, the next 56 step in the chain is gh where h is chosen uniformly at random from C. Now, denote the set of irreducible representations ofG byIrr(G). From [34], for each character of inIrr(G), there is an eigenvalue of the random walk on G generated by C given by (C) dim() (termed a character ratio) occurring with multiplicity dim() 2 . A well known fact from representation theory is (see for example [74]) X 2Irr(G) dim() 2 =jGj; so that we dene the Plancherel measure of G to choose in Irr(G) with probability dim() 2 jGj . Then the unit variance random variable W () = jCj 1=2 (C) dim() is essentially an eigenvalue of the random walk above chosen uniformly at random. In [42] the author denes a Markov chain reversible with respect to the Plancherel measure of a group G with the probability of transitioning from to given by L (;) = dim() dim()dim() 1 jGj X g2G (g) (g) (g): Here is some representation ofG. Following the usual setup, dene an exchangeable pair (W;W 0 ) byW =W () where is chosen from the Plancherel measure ofG, and deneW 0 =W () where is given by one step in the above chain. For the sake of continuity, we will postpone discussing properties of this Markov chain family until later in this section. Using facts from representation theory, [42] proves the following lemma which allows for the application of Theorem 3.1. Lemma 3.20. [42] E W W 0 = (C) dim() W: Theorem 3.21. [42] Let C be a conjugacy class of a nite group G such that C =C 1 and x a nontrivial irreducible representation of G whose character is real valued. Let be a random 57 irreducible representation chosen from the Plancherel measure of G. Let W = jCj 1=2 (C) dim() . Then for all real x 0 , P(Wx 0 ) 1 p 2 Z x0 1 e x 2 2 dx jCj v u u u t X K6=id p 2 (K) 2 jKj 0 @ (K) dim() 1 a + 2 1 A 2 + 2 4 jCj 2 X K p 2 (K) 2 jKj 0 @ 8 + 6 0 @ (K) dim() 1 a 1 A 1 A 3 5 1=4 : Here the sums are over conjugacy classes K of G, a = 1 (C) dim() , and p 2 (K) is the probability that the random walk onG generated byC started at the identity is at the conjugacy class K after two steps. After proving this theorem, the author applies it with the choice of the irreducible represen- tation corresponding to the partition (n 1; 1) (more on this notation later) to obtain a central limit theorem with an error term in the case where G = S n and C is the conjugacy class of i-cycles. We will show that this choice of minimizes the error term from Theorem 3.21 in the special case where C is the conjugacy class of transpositions (2-cycles). Before going further we will explain the notation above and state some facts about the irre- ducible representations of S n found in [74]. For S n , the irreducible representations are indexed by partitions of the integer n where the partition (n) corresponds to the trivial representation. Another way to index the irreducible representations (which can be more useful for combinatorial reasons) is to associate to each partition a \tableau" which is a left justied array of equally sized, aligned boxes (we will typically abuse notation and refer to as both the partition and the diagram). For a partition = ( 1 ; 2 ;:::; k ) of n, where 1 ::: k > 0, the associated tableau has 1 boxes in the rst row, 2 boxes in the second row, and so on. Now, numbering 58 the boxes one through n left to right, top to bottom (with this indexing, this is technically the largest Standard Young Tableau), we make the following denition. Denition. Let = ( 1 ; 2 ;:::; k ) a partition ofn. Dene the content of boxi (in the labeling above) of to be c (i) =column (i)row (i). In order to clarify these denitions, Table 3.1 is the labeled tableau for = (5; 3; 3; 1), a partition of n = 12. 1 2 3 4 5 6 7 8 9 10 11 12 Table 3.1: Labeled tableau for = (5; 3; 3; 1) Table 3.2 below is the same tableau, but the box labeled i above now has the value of the content c (i). 0 1 2 3 4 -1 0 1 -2 -1 0 -3 Table 3.2: = (5; 3; 3; 1) with box i labeled c (i) Because we are specializing to the conjugacy class C of transpositions and p 2 (K) is the prob- ability of being in the conjugacy class K after two steps in the random walk on S n generated by C starting at the identity element, p 2 (K) = 0 for many conjugacy classes. The following lemma formulates (K) dim() in terms of the contents of for conjugacy classes K where p 2 (K)6= 0 (the conjugacy classes that contribute to the error term in Theorem 3.21). The lemma is proved in the form shown here using Murphy's elements in [33], but can also be found in terms of the i in [55]. Lemma 3.22. [33] Let (j) and (2; 2) be the character of the irreducible representation at a the conjugacy class of a j-cycle and two 2-cycles, respectively. Then 59 (id) dim() = 1; (2) dim() = 2 n(n 1) n X i=1 c (i); (3) dim() = 3 n(n 1)(n 2) n X i=1 c (i) 2 n 2 ! ; (2; 2) dim() = 1 6 n 4 0 @ n X i=1 c (i) ! 2 3 n X i=1 c (i) 2 +n(n 1) 1 A : For the example from Tables 3.1 and 3.2 where = (5; 3; 3; 1), the sum of the contents is equal to four and n is equal to twelve. Thus Lemma 3.22 implies, for example, (2) dim() = 2=33. We pause here to discuss some relevant properties of the chain L . First, we reiterate the remarks at the end of Section 3.1; the denition of the family of Markov chains stems from the orthogonality relations of irreducible characters (see [41] and the references therein). Second, because it is not known how to combinatorially describe L in general, it is not obvious how to relate the choice of to step size. However, some representations do have nice descriptions and we can use these to gain some intuition. From [39], if is the dening representation onS n , then given an irreducible representation,L follows the rule of removing a corner box of uniformly at random (so that the resulting diagram of n 1 boxes remains a tableau) and moving it to a uniformly chosen concave open position of the altered diagram (to obtain a new tableau of size n). For the trivial representation, the dening representation, and := (n 1; 1), we have [74] = + , dim() = 1, dim() =n, and dim() =n 1 so that L = nL L n 1 : 60 Now, using the orthogonality relations of characters and the fact that 1 it is easy to see the chain L holds with probability one. Thus, the previous remarks imply that from an irreducible representation , the chain L moves at most one box of . Also, again using orthogonality relations, it follows that for any irreducible representations and , we have L (;) = . That is, from the trivial representation the chain L moves to with probability one, which will move more than one box if 6=. Lemma 3.22 implies that W () = n 2 1=2 P n i=1 c (i), so that moving more boxes of a partition to obtain W 0 fromW corresponds to a larger step size. This is one sense in which has the smallest step size among nontrivial irreducible representations. As discussed previously, a more quantitative measure of step size of the chain L is the value of a from Theorem 3.21. From Lemma 3.20, we have E W W 0 = (2) dim() W = 2 n(n 1) n X i=1 c (i) ! W: By the denition of c (i), it is easy to see that a is minimum over nontrivial irreducible repre- sentations for =, and strictly increases as boxes of a partition are moved from higher to lower rows (this observation motivates Lemma 3.23 below). This is another sense in which has the smallest step size among nontrivial irreducible representations. Now, analogous to Section 3.3, in order to show that the error term from Theorem 3.21 is minimum over nontrivial irreducible representations at = (n 1; 1), we will show that for each conjugacy classK from Lemma 3.22, the termT K () = (K) dim() 1 . a is minimum and at least negative two for = . Moreover, we will dene an ordering on all irreducible representations such that the largest nontrivial irreducible representation in the ordering is (n 1; 1) and then show that T K is a decreasing function with respect to the ordering. This non-standard ordering is a coarsening of the usual dominance ordering found in [74]. 61 Denition. Let = ( 1 ;:::; k ) and = ( 1 ;:::; m ) be two irreducible representations of S n . We say succeeds , denoted , if 1 1 , i = i for i = 2;:::;k 1, and k k . For i>m (i>k), take i = 0 ( i = 0). It is obvious that (n 1; 1) = for all nontrivial irreducible representations , so that to prove the claims above it is enough to show that implies T K ()T K () for each K from Lemma 3.22. The following lemma is a simpler characterization of the succession relation. Lemma 3.23. For two irreducible representations and of S n , if and only if can be transformed from only by moving blocks from the associated tableau of from the top row to either the bottom row of or to start a new row. Proof. If the dening relations of succession hold, and the two representations are not equal, then either k < k or k+1 6= 0. In the former case move a block from 1 to k , in the latter start a new row. Inductively, continuing in this way will eventually lead to since the new tableau still succeeds . Conversely, if can be made from with the described method, then clearly the dening relations of succession must hold. Lemma 3.23 states that in order to prove impliesT K ()T K () (thus that = (n1; 1) minimizes the error term), it is enough to prove the statement in the case = ( 1 1; 2 ;:::; k + 1); where in order to cover both actions described in Lemma 3.23 we break convention and permit k = 0 (only if k1 6= 0). We make one nal (non-standard) denition before we begin the proof of the claims above. Denition. Let = ( 1 ;:::; k ) and = ( 1 1; 2 ;:::; k +1). Dene a new tableau, the joint of and by j = ( 1 1; 2 ;:::; k ): 62 Now that we have suitable denitions, we can begin to prove the lemmas that will be used to obtain the main result. Lemma 3.24. For any tableau , T id () = 0 and T (2) () =1. Proof. The rst assertion follows from the fact that for any representation, the identity element is mapped to the identity matrix so that (id) =dim(). The second is obtained directly from the denition of a . The next lemma states a simpler criterion for determining which of T (3) () and T (3) () is larger. Lemma 3.25. With and as above,T (3) ()T (3) () is non-negative if and only iff() (dened below) is also non-negative. f() = ( 1 + k k) n1 X i=1 c j (i) n 2 ! + ( 1 1)( k + 1k) n1 X i=1 c 2 j (i) + n 2 + 2 n 3 : Proof. T (3) ()T (3) () = 1 1 2( n 3 ) P i c 2 (i) n 2 a 1 1 2( n 3 ) P i c 2 (i) n 2 a a a : (3.11) 63 Notice that P i c (i) is maximized for = (n) in which case it is n 2 and strictly less for all other tableau. Thus, for all tableau under consideration, a > 0 so that the non-negativity of (3.11) is determined by the numerator. The numerator of (3.11) is equal to P i (c 2 (i)c 2 (i)) 2 n 3 + P i (c (i)c (i)) n 2 + P i c 2 (i) P j c (j) P i c (i) P j c 2 (j) + n 2 P i (c (i)c (i)) 2 n 3 n 2 : (3.12) Rewriting the sums P i c t (i) = P i c t j (i)+( 1 1) t and P i c t (i) = P i c t j (i)+( k +1k) t for t = 1; 2 and then simplifying implies (3.12) is equal to ( k + 1k) 2 ( 1 1) 2 2 n 3 + ( 1 1) ( k + 1k) n 2 + ( 1 1) 2 ( k + 1k) 2 P i c j (i) 2 n 3 n 2 + ( 1 1)( k + 1k) (( 1 1) ( k + 1k)) 2 n 3 n 2 (( 1 1) ( k + 1k)) P i c 2 j (i) 2 n 3 n 2 + n 2 (( 1 1) ( k + 1k)) 2 n 3 n 2 : Multiplying by 2 n 3 n 2 =(( 1 1) ( k + 1k)) proves the lemma (since 1 > k and k 2, this multiplication does not aect the non-negativity of the term). Now we can prove the following lemma. Lemma 3.26. Let and be two irreducible representations on the symmetric group. If , then T (3) ()T (3) (): Proof. Lemma 3.25 implies that in order to show the monotonicity of T (3) with respect to the succession relation, we only need to show that f() 0 for all tableau = ( 1 ;:::; k ) with 64 1 > 2 and k < k1 (where k may possibly be zero). In order to prove the lemma, we use induction on the n, the number being partitioned. In each of the three cases below, we associate to each tableau of size n + 1 a tableau of size n where f() 0 by the induction hypothesis. It can be easily veried that in the case of n = 3, the only nontrivial tableau satisfying 1 > 2 and k < k1 is = (2; 1; 0), and f() 0. Case 1: k 6= 0. For this case, let = ( 1 ;:::; k1 ; k 1) be a partition of n and let be equal to the partition ( 1 1;:::; k1 ; k ). Then we have f() = ( 1 + k k) n X i=1 c j (i) n + 1 2 ! + ( 1 1)( k + 1k) n X i=1 c 2 j (i) + n + 1 2 + 2 n + 1 3 ; f() = ( 1 + ( k 1)k) n1 X i=1 c j (i) n 2 ! + ( 1 1)(( k 1) + 1k) n1 X i=1 c 2 j (i) + n 2 + 2 n 3 : Since f() is non-negative by the induction hypothesis, we will show that f()f() 0 which will prove the lemma for this case. Simplifying this expression using the identity n+1 k = n k + n k1 and the fact that n1 X i=1 (c j (i)) t = n X i=1 (c j (i)) t ( k k) t t = 1; 2; we obtain f()f() = (n 1)( 1 1) + (n 1 + 1)( k k) n X i=1 c j (i) n 2 : 65 Now we bound the sum of the contents; n X i=1 c j (i) 12 X i=1 i + k X i=1 (ik) + (n ( 1 1) k )(1 (k 1)) = ( 1 1)( 1 2) 2 + k ( k 3) 2 + (2k)(n 1 + 1): The inequality follows because the rst term on the right hand side is the sum of the contents of row one, the second term is the sum of the contents of row k, and the third term is the number of boxes not in rows one or k times the minimum content of those boxes. The inequality above shows that f()f()g( 1 ; k ) where the function g is dened by: g( 1 ; k ) = (n 1)( 1 1) + (n 1 + 1)( k 2) ( 1 1)( 1 2) 2 k ( k 3) 2 n 2 : LetD =f( 1 ; k ) : k n + 1 1 ; 1 k 1 2g and notice that the allowable values of ( 1 ; k ) are a subset ofD. A straightforward (but tedious) analysis shows the maximum of g over the domainD is non-positive, which proves the lemma for this case. Case 2: k = 0, k1 6= 1. In this case, let = ( 1 ;:::; k1 1; 0) and then let be equal to the partition ( 1 1;:::; k1 1; 1). Then let f() = ( 1 k) n1 X i=1 c j (i) n 2 ! + ( 1 1)(1k) n1 X i=1 c 2 j (i) + n 2 + 2 n 3 ; 66 and notice f() 0 by the induction hypothesis. After simplifying using the fact that n1 X i=1 (c j (i)) t = n X i=1 (c j (i)) t ( k1 k + 1) t t = 1; 2; we obtain f()f() =n( 1 k) + ( k1 1 + 1)( k1 k + 1)n 2 : (3.13) The right hand side of equation (3.13) is increasing in 1 , so that the maximum is attained for 1 =n. In this case it is easy to see (3.13) is non-positive. Case 3: = ( 1 ;:::; k2 ; 1; 0). In this case, let = ( 1 ;:::; k2 ; 0) and let be equal to the partition ( 1 1;:::; k2 ; 1). Analogously to the previous two cases, dene f() = ( 1 (k 1)) n1 X i=1 c j (i) n 2 ! + ( 1 1)(1 (k 1)) n1 X i=1 c 2 j (i) + n 2 + 2 n 3 ; which implies f()f() = (n +k)( 1 n) + n1 X i=1 c j (i) n 2 + 3 2k 1 : Using that 1 n and thus P n1 i=1 c j (i) n1 2 , it is easy to see the term is negative. Now we move on to the nal conjugacy class needed to prove the result. As before we have the following lemma which states a simpler criterion for determining which of T (2;2) () and T (2;2) () is larger. 67 Lemma 3.27. With and as above, T (2;2) ()T (2;2) () is non-negative if and only if h() (dened below) is also non-negative. h() = 2 ( 1 + k k) n 2 n1 X i=1 c j (i) ! 2( 1 1)( k + 1k) 2 n 2 n1 X i=1 c j (i) + n1 X i=1 c j (i) ! 2 + 3 n1 X i=1 c 2 j (i) + 6 n 4 2 n 2 : Proof. T (2;2) ()T (2;2) () = 1 (2;2) dim() a 1 (2;2) dim() a a a : (3.14) As in Lemma 3.25, the non-negativity of (3.14) is determined by the numerator. Using the formulas from Lemma 3.22 and rearranging, the numerator of (3.14) is equal to ( P i c (i)) 2 ( P i c (i)) 2 3 P i [c 2 (i)c 2 (i)] 6 n 4 + P i (c (i)c (i)) n 2 + h ( P i c (i)) 2 3 P i c 2 (i) + 2 n 2 i P j c (j) 6 n 4 n 2 h ( P i c (i)) 2 3 P i c 2 (i) + 2 n 2 i P j c (j) 6 n 4 n 2 : Analogous to Lemma 3.25 for 3-cycles, simplifying the term by rewriting the sums P i c t (i) = P i c t j (i)+( 1 1) t and P i c t (i) = P i c t j (i)+( k +1k) t fort = 1; 2 yields a term proportional toh(). In this case, the constant of proportionality is 6 n 4 n 2 =(( 1 1) ( k + 1k)), which is positive and so does not aect the non-negativity. 68 Now we can prove the following lemma. Lemma 3.28. Let and be two irreducible representations on the symmetric group. If , then T (2;2) ()T (2;2) (). Proof. We will exactly follow the strategy of the proof of Lemma 3.26 with the functionh replacing f. Case 1: k 6= 0. For this case, let = ( 1 ;:::; k1 ; k 1), a partition of n, and be equal to the partition ( 1 1;:::; k1 ; k ). Then following the proof of Lemma 3.26, (h()h())=2 = ( 1 1)( k k + 1) + k n 2 (k + 1) n 2 + (n k +k + 1) n X i=1 c j (i)n( 1 + k k 1) 3 n 3 ( 1 1)( k k + 1) + n 2 ( k k 1) + (n k +k + 1) n 1 2 n( 1 + k k 1) 3 n 3 = ( 1 2)( k k) 1 (n 1) 0: Here the rst inequality follows because n k +k + 1 0 and P i c j (i) n1 2 , and the nal inequality from k kn 1 and 1 2. Case 2: k = 0, k1 6= 1. In this case, let = ( 1 ;:::; k1 1; 0) and then let be equal to the partition ( 1 1;:::; k1 1; 1). Then we have 69 (h()h())=2 = ( 1 k)( k1 k + 1n) + n 2 ( k1 k + 1) + (n k1 +k 1) n X i=1 c j (i) ( k1 k + 1) 2 3 n 3 +n ( 1 k)( k1 k + 1n) + n 2 ( k1 k + 1) + (n k1 +k 1) n 1 2 ( k1 k + 1) 2 3 n 3 +n = ( 1 k1 2)( k1 k + 1n): The inequality follows becausen k1 +k 1 0 and P i c j (i) n1 2 . If 1 k1 2 0, then this case is shown as k1 k+1n 0. If not, then 1 = k1 +1 which implies j = k1 for j = 2:::k 1 (since 1 2 + 1). In this case, (k 1) k1 =n and X i c j (i) = k1 X i=1 k1 X j=1 ji = n 2 ( k1 + 1k): Then we have (h()h())=2 = ( k1 + 1k)( k1 k + 1n) + n 2 ( k1 k + 1) + n 2 ( k1 + 1k)(n k1 +k 1) ( k1 k + 1) 2 3 n 3 +n: This is a downward facing parabola in k1 with roots equal to n +k 1 and n +k 4, neither of which is a possible value of k1 since k must be at least three. Case 3: = ( 1 ;:::; k2 ; 1; 0). In this case, let = ( 1 ;:::; k2 ; 0) and be the partition ( 1 1;:::; k2 ; 1). Then we have 70 (h()h())=2 = (n +k 3) n X i=1 c j (i) +k 1 n 1 2 ! +n(4k) 2 1 +k (2k) 2 (3.15) (n 1)(4 1 ) +k(2 1 ): (3.16) To see the inequality, notice that n +k 3 0 and P i c j (i) n1 2 . The term (3.16) is clearly negative for 1 4, and the smaller cases can be individually handled starting from (3.15) (for example 1 = 2 implies P i c j (i) = n 2 and k =n + 1). The main result of this section is stated in the following. Theorem 3.29. If for nontrivial irreducible representations and , then the error term from Theorem 3.21 associated to is no larger than the error term associated to . In particular, the error term is minimum for = (n 1; 1). Proof. By the previous lemmas and remarks, the theorem will be proved if T (3) () and T (2;2) () are no less than negative two (for K equal to the identity or transposition conjugacy class, T K is constant). Using the formulas from Lemma 3.22, or the fact [74] that for any permutation g, (g) is equal to the number of xed points of g minus one and dim() =n 1, we have T (3) () =3=2; T (2;2) () =2: 71 3.5 Eigenvalue Characterization Heuristically, increasing the step size of a Markov chain has the eect of making the chain become \random" faster. This suggests that for an ergodic chain, the step size may be related to the rate of convergence to stationarity. From basic facts about reversible Markov chains on a nite state space [15], all the chains previously dened are ergodic if and only if they are irreducible and aperiodic. In this case, the rate of convergence to stationarity is determined by the eigenvalues, where the eigenvalues with the largest moduli make the largest asymptotic contributions. Because of this relationship, we will express the error terms from the past sections in terms of the eigenvalues of the Markov chains used to induce the exchangeable pairs. We rst examine the chains from Section 3.3 (and hence also Section 3.2 as previously men- tioned). For a xed value of t, the eigenvalues for L t (i;j) have been computed in [42], but we include the proof since it is illustrative. Recall the denitions and notation from Section 3.3. Lemma 3.30. [42] For xed t2f0; 1;:::;ng, the eigenvalues of L t (i;j) are Ks(t) vs for s = 0:::n. Proof. By denitions and Lemma 3.9, n X j=0 L t (i;j)K s (j) = 1 q n n X r=0 K r (i)K r (t) v 2 r n X j=0 v j K s (j)K r (j) = K s (t) v s K s (i): In other words, ifM is the transition matrix ofL t (i;j) and v is the vector with coordinatej equal to K s (j), then Mv = Ks(t) vs v. A proof of the next lemma follows along the lines of the proof of Lemma 3.17. Lemma 3.31. The eigenvalues of L T (i;j) as dened in Section 3.3 are for s = 0;:::;n, s = n X t=0 b t K s (t) v s : 72 Recall from the proof of Theorem 3.19 the bound on the error depends only on the following term for s = 0; 1; 2, P n t=0 b t Ks(t) vs 1 1 P n t=0 b t K1(t) v1 : (3.17) By Lemma 3.31, (3.17) can be rewritten as ( s 1)=(1 1 ) for s = 0; 1; 2. For s = 0 and s = 1, (3.17) does not depend on values b t , so that we have the following theorem. Theorem 3.32. The size of the error term given by Stein's method via the family of chains from Section 3.3 is a monotone increasing function of 2 1 1 1 ; where 1 and 2 are dened as in Lemma 3.31. A natural problem that arises is to ascertain under what setting the eigenvalues that determine the error term will be the eigenvalues with largest moduli (not including 0 = 1), as these are the eigenvalues with the largest contribution to the rate of convergence. It follows immediately from denitions that in the caseb 0 +b 1 = 1 where the error term is minimum, s = 1 b1qs n(q1) , so that the ordering of the eigenvalues corresponds to the subscript notation. In this case the eigenvalues that aect the error term will be the largest in moduli if and only if b 1 2n(q1) q(n+2) . For the chains L from Section 3.4, the eigenvalues have already been computed in [42] using orthogonality relations. Recall the denitions and notation from Section 3.4. Lemma 3.33. [42] The eigenvalues of L as dened in section 3.4 are for conjugacy classes C of the group G, C = (C) dim : 73 Also, from Theorem 3.21 and the remarks following it, the bound on the error term depends only on the following term for the conjugacy classes K = (id); (2); (3), and (2; 2), (K) dim 1 1 (2) dim : (3.18) By Lemma 3.33 (3.18) can be rewritten as ( K 1)=(1 (2) ) forK = (id); (2); (3), and (2; 2). For K = (id) and K = (2), (3.18) does not depend on the irreducible representation used to generate L , so that we have the following theorem. Theorem 3.34. The size of the error term given by Stein's method via the family of chains from Section 3.4 is a monotone increasing function of K 1 1 (2) ; where K is dened as in Lemma 3.33 and K = (3) or K = (2)(2). Once again it is natural to ask for a given irreducible representation , whether the three eigenvalues that aect the error term are those with the largest moduli (not including (id) = 1). For = (n 1; 1), the representation where the error term is minimum, it is well known [74] that (K) dim = F K 1 n 1 ; where F K is dened to be the number of xed points of K. In this case, the eigenvalues that aect the error term are the eigenvalues with the second, third, and fourth largest moduli. 3.6 Poisson Approximation For the nal two sections of this chapter we change focus from approximation by the normal distribution to approximation by the Poisson distribution. We will examine how the step size of 74 the Markov chain that induces an exchangeable pair aects the error term in Theorem 2.13 (the slightly modied version below). Theorem 2.13 [22] Let W = P i X i , a sum of non-negative integer valued random variables, such that E(W ) = . Let (W;W 0 ) an exchangeable pair, c any constant, and Z denote the Poisson random variable with mean . Then for C = min(1; 1=2 ), d TV (W;Z) C E cP(W 0 =W + 1jfX i g) +E WcP(W 0 =W 1jfX i g) : (3.19) Remarks. 1. It has been shown [71] that a result similar to Theorem 2.13 but with additional error terms still holds assuming only that W and W 0 are equally distributed. 2. Ideally, c should be chosen so that we have the approximate equalities P(W 0 =W + 1jfX i g) c ; (3.20) P(W 0 =W 1jfX i g) W c : (3.21) It is shown in [22] that intuitively the existence of such a constant is likely, a heuristic that is reinforced in the examples presented there. In fact, if (W 0 W )2f1; 0; 1g and E W (W 0 W ) =a(WE(W )), it is easy to see that for the choice of c = 1=a we have the same error in the approximate equalities (3.20) and (3.21). This is in general a useful guide for the choice of the constant c (for more on this line of thought see [70]). One of the main technical details in analyzing (3.19) for dierent exchangeable pairs, is the choice of the constant c. It would be preferable to have a systematic method of choosing this constant based on the exchangeable pair so that the results here are not contrived. Ideally, we 75 would choose the constant c to minimize the error terms from Theorem 2.13, or more feasibly, their Cauchy-Schwarz bound (choosec to make the expectation of the terms in the absolute value signs zero). However, in the examples presented here, we will choose the constant c to yield the best possible bound under the constraint that the terms in the absolute value signs are positive. Admittedly, part of the reason for this restriction is technical convenience, but in practice choosing the constant in this way is typical (see the examples of [22]). In the next section we will compare the error terms using both of these strategies in a small example. Both of the examples presented here are sums of i.i.d. random variables (Bernoulli and ge- ometric). Even the simplest introduction of dependence (e.g. the hypergeometric distribution) yield results that make the type of analysis in this paper dicult. Because we are in the setting of independence, we can use the same family of Markov chains for both examples. Given a vector (X 1 ;:::;X n ) of non-negative integer valued i.i.d. random variables, the next step in the chain follows the rule of choosing k coordinates uniformly at random and replacing them with k new i.i.d. random variables (with the same distribution as the original). It is not hard to see that this chain is reversible with respect to vectors of i.i.d. random variables and hence generates an exchangeable pair. Extending this exchangeable pair to the sum of the components of the vector allows for the application of Theorem 2.13. Finally, notice that under this chain it is not clear how modifying the number of coordinates chosen to be selected (i.e. varying k) will aect the error term. 3.7 Binomial Distribution It is well known that the binomial distribution with parameters n and p converges to a Poisson distribution with mean as n tends to innity if np tends to . For simplicity, in this section we consider the case where p = 1=n, so that = 1. In this case we will show that among the exchangeable pairs associated with the family of Markov chains described in Section 3.6 the term 76 from Theorem 2.13 is minimized when k = 1. First we will prove some lemmas that will be used to compute the error term from the theorem. Lemma 3.35. LetP k denote probability under the chain that substitutesk coordinates as described in Section 3.6. Then P k (W 0 =W + 1jfX i g) = k1 X i=0 k i + 1 (n 1) ki1 n k W i nW ki n k ; P k (W 0 =W 1jfX i g) = k X i=1 k i 1 (n 1) ki+1 n k W i nW ki n k : Proof. Let the random variable Y be the number of ones in the k coordinates chosen. Then P k (W 0 =W +1jY =i) is the probability ofi+1 ones in the binomial distribution with parameters k and p = 1=n, which implies P k (W 0 =W + 1jfX i g) = k1 X i=0 P k (W 0 =W + 1jY =i)P k (Y =i) = k1 X i=0 k i + 1 (n 1) ki1 n k W i nW ki n k : Conditioning and summing over Y also yields the expression for P k (W 0 = W 1jfX i g) in the lemma. For the remainder of the section dene c k = n k n n 1 k1 : (3.22) The next two lemmas prove a useful property of the constant c k . 77 Lemma 3.36. 1c k P k (W 0 =W + 1jfX i g) 0. Proof. Conditioning and summing over the random variableY equal to the number of ones chosen in the k coordinates, P k (W 0 =W + 1jfX i g) = k1 X i=0 k i + 1 n 1 n ki1 1 n i+1 P k (Y =i) = k1 X i=0 k n(i + 1) n 1 n k1 k 1 i 1 n 1 i P k (Y =i) k1 X i=0 k n n 1 n k1 P k (Y =i) k n n 1 n k1 : Here the rst inequality follows from the fact that kn. Lemma 3.37. Wc k P k (W 0 =W 1jfX i g) 0. Proof. ForW = 0, the lemma is trivially true. ForW6= 0, we condition and sum over the random variable Y equal to the number of ones chosen in the k coordinates to obtain P k (W 0 =W 1jfX i g) = k X i=1 P k (W 0 =W 1jY =i) " W i nW ki n k # = k X i=1 k i 1 n 1 n ki+1 1 n i1 Wk n " W1 i1 nW ki i n1 k1 # Wk n n 1 n k1 k X i=1 k i 1 1 n(n 1) i2 " W1 i1 (n1)(W1) (k1)(i1) n1 k1 # Wk n n 1 n k1 : To see the nal inequality, notice that for each summand, the second part of the product is a probability of the hypergeometric distribution and the rst part of the product is at most one. 78 The previous two lemmas show that forc =c k dened by (3.22), the terms within the absolute values in (3.19) are positive. Under this constraint, note that the error terms are decreasing in c k and that 1c k P k (W 0 =W + 1jW = 0) = 0. These observations imply that among constants satisfying Lemmas 3.36 and 3.37, the error from Theorem 2.13 is minimized for each k when c k is dened as (3.22). As discussed in the previous section, this is a natural way to choose the constant in the approximation that allows for the analysis done here. We pause here to show in a simple example the dierence in the error terms from Theorem 2.13 when choosing the constantc according to the two approaches outlined in Section 3.6. First, we will determine the error terms using the strategy we take here in the case k = 1. For the following computations, recall that p = 1=n. We have 1c 1 P 1 (W 0 =W + 1jfX i g) = 1 (nW )p =pW; Wc 1 P 1 (W 0 =W 1jfX i g) =WW (1p) =pW; which implies the error from Theorem 2.13 is equal to 2p (from [22],C = 1 for 1). Choosing instead c 0 1 = n=(1p), so that c 0 1 P 1 (W 0 = W + 1) = 1 (this was discussed as the alternative system of choosing c), we obtain 1c 0 1 P 1 (W 0 =W + 1jfX i g) = 1 (nW )p 1p = p(W 1) 1p ; Wc 0 1 P 1 (W 0 =W 1jfX i g) =WW = 0; which implies the error from Theorem 2.13 is equal to pEjW 1j 1p p p 1p : 79 In the limit, the two error terms dier in quality only by a constant andc 1 is asymptotically equal to c 0 1 . Although the Cauchy-Schwarz approach using c 0 1 asymptotically yields a better constant, computing the appropriate moment information for general k using this scheme is much more dicult than the strategy we choose. Also, this small example suggests that the Cauchy-Schwarz approach will yield superior asymptotic rates only in the constant, so that it is not worth the extra eort of computing the more complicated (and higher) moment information needed in order to undertake the type of analysis presented in this paper. Finally, we note that using the chain here (with k = 1), it is possible to use intermediate terms in the proof of Theorem 2.13 with the constant c 1 to obtain the superior upper bound of p (as we did in Section 2.4), however this approach does not carry over to the chains with larger step size. Moving forward, in order to apply the theorem, we need to take the expected value of the terms in Lemma 3.35. The next lemma has a nice expression for the expectation we need. Lemma 3.38. E W i nW ki i!(ki)! =n (nk + 1) (n 1) ki n k : Proof. E(s W r nW ) =r n E s r W = s n +r n 1 n n : (3.23) Takingi derivatives with respect tos andki derivatives with respect tor of (3.23) and evaluating at r =s = 1 implies the lemma. The nal lemma in this section establishes bounds on the error from Theorem 2.13. 80 Lemma 3.39. Both of E[c k P k (W 0 =W + 1jfX i g)] and E[c k P k (W 0 =W 1jfX i g)] are bounded above by n 1 n : Proof. From Lemmas 3.35 and 3.38 we have E[c k P k (W 0 =W + 1jfX i g)] =c k k1 X i=0 k i + 1 k i (n 1) 2k2i1 n 2k =c k n 1 n k k n k1 X i=0 k 1 i 2 (n 1) ki1 n k1 (n 1) i k (i + 1)(ki) = n 1 n k1 X i=0 k 1 i (n 1) ki1 n k1 k 1 i 1 (n 1) i k (i + 1)(ki) n 1 n : The inequality follows from the fact that each summand is the product of two terms no larger than one and a probability. For the remaining term, exchangeability implies E[c k P k (W 0 =W 1jfX i g)] =E[c k P k (W 0 = W + 1jfX i g)], which proves the lemma. Theorem 3.40. For the values of c k dened previously, the error term from Theorem 2.13 is minimized for k = 1 and is equal to 2p. Proof. The nal bound was computed previously in this section and the fact that it is minimum follows directly from Lemma 3.39 and the easily veried fact E[c 1 P 1 (W 0 =W + 1jfX i g)] =E[c 1 P 1 (W 0 =W 1jfX i g)] = n 1 n : 81 3.8 Negative Binomial Distribution The nal example presented in this paper is the approximation of the negative binomial distri- bution by the Poisson. A random variable X has the geometric distribution with parameter p if P(X = i) = (1p) i p for all non-negative integers i. Classically, the random variable X is viewed as the number of failures before the rst success in a sequence of independent Bernoulli trials each with probability of success equal to p. The random variable W is negative binomial with parameters r and p if W = P r i=1 X i , where the X i are independent geometric random variables with parameter p. By viewing W as the number of failures before r successes have occurred in a sequence of Bernoulli trials, it is easy to see that for all non-negative integers i, P(W =i) = r+i1 i (1p) i p r . We will use Theorem 2.13 to approximate W by Poi() where is the mean of W equal to r(1p)=p (the mean of a geometric random variable is (1p)=p). For xed , p =r=( +r) so that P(W =i) = i i! (r +i 1)! (r 1)!( +r) i 1 + r r : (3.24) As r goes to innity, the distribution converges to a Poisson distribution with mean . However, for xed , p approaches one as r goes to innity, so that when p is small the negative binomial will not be approximately Poisson. Because of this fact, in this example we will not obtain a result as straightforward as Theorem 3.40. For some values of p, the optimal error term does not occur with the smallest step size. We will prove all supporting lemmas for general p, but the nal theorem will have a natural restriction on the value of p. For this case we will show that among the exchangeable pairs associated with the family of Markov chains described in Section 3.6 the term from Theorem 2.13 is minimized when k = 1. First we will prove some lemmas that will be used to compute the error term from the theorem. 82 Lemma 3.41. LetP k denote probability under the chain that substitutesk coordinates as described in Section 3.6. Then P k (W 0 =W + 1jfX i g) = r k 1 X fi1;:::;i k g f1;:::;rg k + P j X ij k 1 (1p) ( P j Xi j +1) p k P k (W 0 =W 1jfX i g) = r k 1 X fi1;:::;i k g f1;:::;rg k + P j X ij 2 k 1 (1p) ( P j Xi j 1) p k : Proof. This follows immediately from conditioning and summing over the subset of (X 1 ;:::;X r ) chosen. For the remainder of the section dene c k =a k k k +a k 2 k 1 (1p) a k p k1 1 (3.25) where a k is the maximum of one and integer part of (k2)(1p) p . The next lemma states a useful property of a k ; the proof can be found in [56], but it is elementary so we will include it. Lemma 3.42. Let r be a positive integer, x any non-negative integer, and a be the integer part of (r1)(1p) p . If f(y) = r+y1 r1 (1p) y p r , then f(x)f(x 1) for xa and strictly decreasing otherwise. In particular, a is the mode of a negative binomial random variable with parameters r and p. Proof. The ratio of f evaluated at consecutive integers is given by f(x + 1) f(x) = x +r x + 1 (1p): 83 Comparing this ratio to one implies the lemma. The next two lemmas prove a useful property of the constant c k as dened by (3.25). Lemma 3.43. c k P k (W 0 =W + 1jfX i g) 0: Proof. c k P k (W 0 =W + 1jfX i g) = = a k k r k 1 X fi1;:::;i k g f1;:::;rg k+ P j Xi j k1 (1p) ( P j Xi j +1) p k k+a k 2 k1 (1p) a k p k1 : Deneb to be the maximum of one and the integer part of (k1)(1p) p and note thatb = 1 implies a k = 1. Lemma 3.42 implies c k P k (W 0 =W + 1jfX i g) a k k r k 1 X fi1;:::;i k g f1;:::;rg k+b1 k1 (1p) b p k k+a k 2 k1 (1p) a k p k1 = (k +b 1)p k k+b2 k2 (1p) b k+a k 2 k2 (1p) a k : The nal inequality follows from the denition ofb and by applying Lemma 3.42 withr =k1. 84 Lemma 3.44. Wc k P k (W 0 =W 1jfX i g) 0. Proof. The cases W = 0 or k = 1 are simple to verify, so assume otherwise. By Lemma 3.41 and the denition of c k , c k P k (W 0 =W 1jfX i g) = X fi1;:::;i k g f1;:::;rg P j X ij r 1 k 1 k + P j X ij 2 k 2 (1p) ( P j Xi j ) p k1 k +a k 2 k 2 (1p) a k p k1 r 1 k 1 1 X fi1;:::;i k g f1;:::;rg 0 @ X j X ij 1 A = r X i=1 X i =W: An application of Lemma 3.42 with r =k 1 implies the inequality. The previous two lemmas show that forc =c k dened by (3.25), the terms within the absolute values in (3.19) are positive. Also note that c k P k (W 0 =W 1jX 1 =a k ;X i = 0;i> 1) =a k =W6= 0; so that among constants satisfying Lemmas 3.43 and 3.44, the error from Theorem 2.13 is mini- mized for each k when c k is dened as (3.25). To apply the theorem, we need to take the expected value of the term in Lemma 3.41. The next lemma has a nice expression for the expectation we need. Lemma 3.45. If Y is a random variable distributed as negative binomial with parameters p and k, then E k +Y k 1 (1p) Y = P k1 l=0 k l+1 k+l1 l (1p) 2 p(2p) l (2p) k : 85 Proof. By the denition of expected value, E k +Y k 1 (1p) Y = X i0 (1p) i k +i k 1 (1p) i k +i 1 k 1 p k = P i k+i k1 k+i1 k1 ((1p) 2 ) i (p(2p)) k (2p) k : (3.26) If Z is a random variable distributed as negative binomial with parameters and q = p(2p) and k, then (3.26) can be written as E h k+Z k1 i =(2p) k . Using the fact that Z is the sum of independent geometric random variables we have E(s k+Z ) =s k q 1 (1q)s k =q k s k 1 1 (1q)s k : Taking k 1 derivatives with respect to s and dividing by (k 1)! implies E k +Z k 1 s Z+1 = q k (k 1)! k1 X l=0 k 1 l k! (l + 1)! s l+1 (1q) l (k +l 1)! (k 1)! 1 1 (1q)s k+l : (3.27) Finally, substituting s = 1 into (3.27) implies the lemma. The nal results of this section will be stated in two cases. The rst case will pertain to \small" values of k where (k 1)(1p)=p< 1, and the \large" case to all other values of k. For xed k and , the small case is in some sense the typical case as p should be near one in order for W to be approximately Poisson. In this case, there is no need for further restrictions on the value of p in order to prove results analogous to the previous section. However, in the large case additional assumptions will be made. We will rst state and prove results for the small case, then discuss the additional assumptions and prove results for the large case. 86 Lemma 3.46. For (k1)(1p) p < 1, both of E[c k P k (W 0 = W + 1jfX i g)] and E[c k P k (W 0 = W 1jfX i g)] are bounded above by r(1p) 2p : Proof. For k = 1; 2, the lemma is easy to verify, so assume k 3. Let Y be a random variable distributed as negative binomial with parameters p and k. Then we have c k E[P k (W 0 =W + 1jfX i g)] =c k (1p)p k E k +Y k 1 (1p) Y =c k (1p)p k P k1 l=0 k l+1 k+l1 l (1p) 2 p(2p) l (2p) k (3.28) = r(1p) 2p 1 k(2p) k1 k1 X l=0 k l + 1 1p p l k +l 1 k 1 1p 2p l : Noting rst that (k 1)(1p)< 1, an application of Lemma 3.42 implies k+l1 k1 1p 2p l is at most one, yielding the following inequality. c k E[P k (W 0 =W + 1jfX i g)] r(1p) 2p 1 k(2p) k1 k1 X l=0 k l + 1 1p p l = r(1p) 2p 1p k k(1p)(p (2p)) k1 : From the previous lines, it is enough to show for all 3kr, the following term is at most one: 1p k k(1p)(p (2p)) k1 : (3.29) 87 The dierence of (3.29) applied at k + 1 and k is positively proportional to k k X i=0 p i ! (k + 1)p (2p) k1 X i=0 p i ! : (3.30) We will show that this dierence is at most zero which implies (3.29) is decreasing in k so that it is enough to show the lemma holds in the case where k = 3. Notice that (k1)(1p) p < 1 implies k< 1=(1p), so that k k X i=0 p i ! (k + 1)p (2p) k1 X i=0 p i ! = k1 X i=0 p i ! (k(1p) 2 p (2p)) +kp k < k1 X i=0 p i ! (1 3p +p 2 ) +kp k : (3.31) The smallk condition fork 3 implies in particular that 2=3<p< 1 so that 1 3p +p 2 < 0 which, starting from (3.31), yields k k X i=0 p i ! (k + 1)p (2p) k1 X i=0 p i ! < (1 3p +p 2 ) + (1 3p +p 2 )(k 1)p k1 +kp k = (1 3p +p 2 ) +p k1 (k(1p) 2 (1 3p +p 2 )) 1 3p +p 2 +p k (2p) 1 3p +p 2 + 2p 3 p 4 : (3.32) The penultimate inequality follows from the fact noted above that k < 1=(1p), and the nal inequality since k 3. From this point it is a straightforward calculus exercise to show the nal term in (3.32) is negative for 2=3<p< 1. 88 For the remaining term, exchangeability implies E[c k P k (W 0 =W 1jfX i g)] =E[c k P k (W 0 =W + 1jfX i g)]; (3.33) which proves the lemma. In order to prove a result analogous to Lemma 3.46 for the case of k 1=(1p), the value of p will be restricted. Ideally, the values ofp under consideration should coincide with the values of p where the Poisson distribution is a good approximation to the negative binomial distribution. First note thatVar(W ) = + 2 =r so that it is not unreasonable to assume that 2 r. We will instead use a stronger restriction; for the remainder of the section take 3 2 e r. This may seem like a demanding constraint, but from (3.24) it is clear that in order for the negative binomial distribution to resemble a Poisson distribution, e should be close to (1 +=r) r . The next lemma shows that the assumption on r is not unreasonable in lieu of the previous statement; it can be proved by standard analysis using the Taylor expansion of the appropriate functions. Lemma 3.47. For r> 2 > 0, e 2 7r e 1 + r r e 2 2r : Remark. It has been shown [16] thatjjWPoi jj TV =r, which implies that for some values ofp andr the restriction 3 2 e r is an overly demanding constraint. It is an interesting problem to consider what is the minimum constraint that will yield results analogous to Section 3.7 and how it relates to the proximity of the negative binomial to the Poisson distribution. Lemma 3.48. For 3 2 e r and k 1=(1p), both of E[c k P k (W 0 = W + 1jfX i g)] and E[c k P k (W 0 =W 1jfX i g)] are bounded above by r(1p) 2p : 89 Proof. Let Y be a random variable distributed as negative binomial with parameters p and k. Then continuing from (3.28), for k 2, c k E[P k (W 0 =W + 1jfX i g)] = r(1p) 2p (k 1)(1p) k(2p) k1 k1 X l=0 k l + 1 1p p(2p) l k+l1 k1 (1p) l k+a k 2 k2 (1p) a k : Using the denition of a k , Lemma 3.42, and an argument similar to the use of the constant b in the proof of Lemma 3.43 (using the fact that k 1=(1p)), we obtain an upper bound of 1=p on the appropriate fraction in each summand, which yields the following inequality. c k E[P k (W 0 =W + 1jfX i g)] r(1p) 2p (k 1) k(2p) k2 k1 X l=0 k l + 1 1p p(2p) l+1 = r(1p) 2p (k 1) k(2p) k2 (1p) (2p)p + 1 k 1 ! : From the previous lines, it is enough to show for all 2kr, (k 1) k(2p) k2 (1p) (2p)p + 1 k 1 ! 1: (3.34) The ratio of successive terms is equal to k 2 k 2 1 1+pp 2 p(2p) 2 1+pp 2 p(2p) k (2p) 1 1+pp 2 p(2p) k 1 ; 90 which is at least one. Thus it is enough to show the inequality (3.34) fork =r. Now, substituting p =r=( +r) into (3.34) yields r 1 r 1 2 +r r2 1 + ( +r) r(2 +r) r 1 e 2 +r +r 2+2 e 1 : (3.35) For the inequality we use the fact that (1 +x=n) n (1 +x=(n + 1)) n+1 e x if n +x is positive and n 1. The term (3.35) is clearly decreasing in r; by using the restriction on the value of r and then the inequality log(1 +x)x, we have r 1 r 1 2 +r r2 1 + ( +r) r(2 +r) r 1 exp (2 + 2) log 1 + 1 1 + 3e 1e exp 2 + 2 1 + 3e 1e : (3.36) Taking the natural logarithm of (3.36), we have 2 + 2 1 + 3e + log 1e = 2 + 2 1 + 3e X i1 e i i : (3.37) The nal expression is smaller than any partial sum, and it is easy to see by only taking one term in the sum (3.37) is negative for 2, and taking three terms yields the proper inequality for 1. For the remaining term, the equation (3.33) continues to hold in this case, which proves the lemma. Theorem 3.49. For the values of c k dened previously, and the set of k where either 3 2 e r and k 1=(1p) or k< 1=(1p), the error term from Theorem 2.13 is minimized for k = 1. 91 Proof. This follows directly from Lemmas 3.46 and 3.48 and the easily veried fact E[c 1 P 1 (W 0 =W + 1jfX i g)] =E[c 1 P 1 (W 0 =W 1jfX i g)] = r(1p) 2p : 92 Chapter 4 Theorems and Applications This chapter contains new Poisson and translated Poisson approximation theorems using ex- changeable pairs and also discusses the utility and applications of Theorem 4.8 proved in Section 1.1. In Section 4.1, we apply a novel exchangeable pairs Poisson approximation theorem similar to Theorem 2.15 to the Poisson approximation of the number of i-cycles in a random permutation. In that section we also prove a lemma that will be used in Section 4.2 to prove a new exchange- able pairs translated Poisson approximation theorem. In Section 4.3, we apply the \smoothing" inequality of Theorem 4.8 to a technical term that arises in many Stein's method discrete normal approximation theorems including the main result of Section 4.2. Admittedly, a deciency of this chapter is that the the examples are mostly simple or non- existent. The potential scope of these theorems appears to be much larger than the examples presented here; we hope to address this in future research. 4.1 Poisson Approximation The rst result we state is a more general version of Theorem 2.15 proved previously. The added generality will be used in an example below. 93 Theorem 4.1. Let W a random variable with non-negative integer support andY a random vector such that W = G(Y) for some function G. Also, let E(W ) = , Var(W ) = 2 and (W;W 0 ) an exchangeable pair of real random variables such that E Y (W 0 W ) =a(W) +R(Y); with 0<a 1. Then with Z a Poisson random variable with mean and c any constant, d TV (W;Z) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a + 1 2 + EjR(Y)j a 1=2 + Ej(Wc)R(Y)j a (4.1) + E[jW 0 Wj 3 jW 0 Wj] 2a : Proof. First, recall the denitions and notation of the proof of Theorem 2.15. A minor modica- tion of Lemma 2.4 implies, for any bounded f, E [(W 0 W )(f(W 0 )f(W ))] 2aE [(W)f(W )] + 2E[f(W )R(Y)] = 0; (4.2) which leads to d TV (W;Z) = sup AZ + jEA P f A (W )j sup AZ + jEA P f A (W )EA 0 f A (W )j + sup AZ + E f A (W )R(Y) a : (4.3) 94 By Lemma 2.11, (4.3) is bounded above by EjR(Y)j a 1=2 , and the remaining term of 4.1 not present in the error of Theorem 2.15 arises from the fact that the allowance of the additional remainder implies E(W 0 W ) 2 = 2a 2 2E[(Wc)R(Y)]; where c is any constant (note (4.2) impliesER(Y) = 0). Remarks. 1. The hypotheses of Theorem 4.1 dier from those of Theorem 2.15 only in relaxing the linearity condition to allow the remainder R(Y), which aects the error by contributing extra terms. Moreover, as the proof indicates, any of the theorems stated requiring the linearity condition (Theorems 2.5, 2.15, and 3.1 above and Theorems 4.6 and 4.7 below) can be relaxed to the hypotheses here at the cost of adding terms similar to the nal two terms of (4.1). 2. For Poisson approximation, the mean and the variance of W should be approximately and not growing with n, some natural parameter. In most applications, a is approximately n 1 , which implies that if the bound is to be order n 1 (the usual bound for Poisson approximation)EjR(Y)j must be of order n 2 . In order to apply Theorem 4.1, it is crucial to have good upper bounds on the variances in the error term. The following proposition can be useful in obtaining these bounds. Proposition 4.2. For random variables Y 1 ;:::;Y k , we have Var 0 @ k X j=1 Y j 1 A (2k 1) k X j=1 Var(Y j ); (4.4) 95 Proof. Without loss of generality, assume Var(Y 1 )Var(Y 2 ):::Var(Y k ). We have Var 0 @ k X j=1 Y j 1 A = k X j=1 Var(Y j ) + 2 X i<j Cov(Y i ;Y j ) k X j=1 Var(Y j ) + 2 X i<j q Var(Y i )Var(Y j ) k X j=1 Var(Y j ) + 2 X i<j Var(Y j ); where the rst inequality is by the Cauchy-Schwarz inequality, and the second by the ordering of the variances. Noting that P i<j Var(Y j ) (k 1) P k j=1 Var(Y j ) proves the lemma. The next result will use Theorem 4.1 to derive an error term in the approximation of the number of i-cycles in a random permutation by the Poisson distribution. 4.1.1 i-cycle Example Let be chosen uniformly at random fromS n , the symmetric group onn symbols, and letW i () be the number of i-cycles of . It is well known that W i is approximately Poisson: for Z i a Poisson random variable with mean 1=i, d TV (W i ;Z i ) tends to zero with i=n super exponentially [4]. In fact, the comparison of the cycle structure of permutations to Poisson limits has been well studied and generalized, see [1, 5]. We will use Theorem 4.1 to show a sub-optimal bound of linear order ini=n; in the case of xed points this is the typical order that has been found using Stein's method [10, 22]. First we dene the exchangeable pair used to apply Theorem 4.1. First note that the Markov chain dened by the rule of multiplying by a random transposition is reversible with respect to the uniform distribution onS n . Thus, by Proposition 1.2, if is a random transposition and 0 = where is chosen uniformly from S n , then the pair (W i ;W 0 i ) = (W i ();W i ( 0 )) is exchangeable. 96 The next few results collect some facts about (W i ;W 0 i ), the rst contains expressions for expectations needed in order to apply Theorem 4.1; it is proved using the cycle index generating function. Lemma 4.3. [35] Let b 1 ;:::;b k non-negative integers such that P j jb j n, and X 1 ;:::;X k independent Poisson random variables with means 1; 1=2;:::; 1=k, respectively. Then E 2 4 k Y j=1 W bj j 3 5 = k Y j=1 E h X bj j i : Now, let W = (W 1 ;:::;W n ) and notice thatW i is a function of W. The next lemma expresses the probabilities that will be used in the remainder of the example in terms of W. Lemma 4.4. Let P k W =P(W 0 i =W i +kjW). Then P 1 W = iW i (niW i ) +W i i 2 n 2 ; P +1 W = i1 X j=1 j(ij)W j W ij 2 n 2 W i=2 i 2 2 2 n 2 + n 2iW 2i iW i n 2 i1 X j=1 jW j n 2 ; (4.5) P +2 W = iW 2i n 2 ; P 2 W = i 2 W i (W i 1) n(n 1) ; P k W = 0;jkj> 2 Proof. Multiplication of a permutation with a transposition aects the cycle structure of by either splitting one cycle into two, or joining two cycles. The former operation occurs when the letters in are contained in the same cycle in , while the latter occurs when the letters in are in dierent cycles of . For example, multiplying right to left, (a 1 ;:::;a k )(b 1 ;:::;b j )(a s ;b t ) = (a 1 ;:::;a s ;b t+1 ;:::;b j ;b 1 ;:::;b t ;a s+1 ;:::a k ): 97 To compute P +1 W , note that in order for the number of i-cycles of to increase by one with the multiplication, the transposition must either join two smaller cycles or split a larger cycle of length not equal to 2i (or else the number ofi-cycles would increase by two). The rst term of (4.5) corresponds to joining two cycles, and the second term compensates in the case of joining two i=2 cycles (if i is even). The third and fourth term of (4.5) together represent the splitting of cycles larger than i, but have been rewritten using the fact that P j jW j =n. The remaining equalities follow along similar lines of thought. Now, Lemma 4.4 implies E [W 0 i W i jW] = 2i n 1 W i 1 i + n 2 1 2 4 W i i 2 + i1 X j=1 jW j (ij)W ij 2 1 W i=2 2 i 2 2 3 5 : (4.6) By Lemma 4.3, E(W i ) = 1=i =, so that Theorem 4.1 can now be applied with a = 2i=(n 1), andR(W) given by (4.6). However, in this example,E(W i ) =Var(W i ) and 1 1=2 1, so that the bounds on the solution of the Stein equation given by Lemma 2.11 imply Theorem 4.1 can be modied to d TV (W i ;Z i ) p Var(E[(W 0 i W i )(W 0 i W i + 1)jW i ]) 2a (4.7) + EjR(W)j a + Ej(Wc)R(W)j a (4.8) + E[jW 0 i W i j 3 jW 0 i W i j] 2a ; (4.9) where Z i is a Poisson random variable with mean 1=i. 98 Using Lemma 4.4 and the fact [36] that if 1 and 2 are sigma algebras with the property that 2 1 , then Var(E[Yj 2 ])Var(E[Yj 1 ]), we have Var(E[(W 0 i W i )(W 0 i W i + 1)jW i ])Var(E[(W 0 i W i )(W 0 i W i + 1)jW]) =Var(2P +1 W + 6P +2 W + 2P 2 W ) = n 2 2 Var i1 X j=1 j6=i=2 j(ij)W j W ij +W i=2 (W i=2 1) i 2 2 2 i X j=1 jW j + 2iW 2i +i 2 W i (W i 1) : Proposition 4.2 yields Var(E[(W 0 i W i )(W 0 i W i + 1)jW i ]) (4i + 1) n 2 2 i1 X j=1 j6=i=2 Var(j(ij)W j W ij ) +Var W i=2 (W i=2 1) i 2 2 ! (4.10) + i X j=1 Var(2jW j ) +Var (2iW 2i ) +Var(i 2 W i (W i 1)) : Applying Lemma 4.3 to (4.10) yields that if 4i n (so that Lemma 4.3 can be applied to Var(2iW 2i )), the term corresponding to (4.7) in the error from Theorem 4.1 is on the order of i=n. Next, notice E[jW 0 Wj 3 jW 0 Wj] = 6E[P +2 W +P 2 W ] = 12 n(n 1) ; 99 so that (4.9) contributes 3 ni to the error. Finally, choosing c = 0 in the second term of (4.8), an application of Lemma 4.3 coupled with the triangle inequality yields, for 4in, d TV (W i ;Z i ) O(i) n : Remarks. 1. The constraint 4in is probably not necessary to obtain an error term using Theorem 4.1, but in order to remove it, we would need to undertake a more careful analysis. Also, we have not computed the error term precisely here, but it would not be too dicult to do so from what has been presented. In either case, the result does not warrant the extra eort; the main point here is to illustrate in a non-trivial example that Theorem 4.1 does indeed give reasonable bounds, which gives hope that it can be applied to obtain new results. 2. An error term of the same order can be obtained by Theorem 2.13. 4.1.2 Poisson Lemma The next result is another Poisson approximation theorem with more complicated terms than those of Theorem 4.1 (in fact, it is a generalization of that result). It will lead to our main translated Poisson approximation theorem in the next Section. Lemma 4.5. Let W a random variable with non-negative integer support, E(W ) = , and Var(W ) = 2 . Let (W;W 0 ) an exchangeable pair of real random variables such that E W (W 0 100 W ) =a(W) with 0<a 1. Then with Z a Poisson random variable with mean and M a natural number, d TV (W;Z) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a + 1 2 (4.11) + E I fjW 0 Wj>Mg jW 0 Wj 3 jW 0 Wj 2a + M 2 (M 1)d TV (W;W + 1) a (4.12) + M(M 1) 2 2a 3=2 1 + 1=2 + M(M 1) a M X jjj=2 q Var(P W (W 0 W =j)) Proof. Following the proof of Theorem 2.15 (where for the sake of brevity denegf A ), we have the bound below along the lines of (2.31) and (2.32), jEAg(W )j = E g(W ) (W 0 W )(g(W 0 )g(W )) 2a E g(W ) (W 0 W )(W 0 W + 1)) 2a (4.13) + 1 2a E [(W 0 W )(g(W 0 )g(W ) g(W )(W 0 W + 1))] : (4.14) The rst summand (4.13) can be bounded above by (4.11) exactly as the upper bound of (2.29) over the rst term in (2.31) was obtained. In order to simplify the proof that (4.14) is bounded above by the remaining terms in the error, let D =W 0 W . Then we obtain E[D(g(W 0 )g(W ) g(W )(D + 1))] =E[DI fjDj>Mg (g(W 0 )g(W ) g(W )(D + 1))] (4.15) +E[DI fjDjMg (g(W 0 )g(W ) g(W )(D + 1))] (4.16) =J 1 +J 2 ; where J 1 is (4.15) and J 2 is (4.16). 101 Following the proof of Theorem 2.15, it is not dicult to see thatjJ 1 j is no more than the rst term of (4.12). In order to show jJ 2 j is bounded above by the unaccounted terms in the error from the Theorem, notice thatAg(j) =I fj2Ag P(Z2A) implies g(j) = (j)g(j) +I fj2Ag ; and exchangeability implies E[DI fMD<0g (g(W 0 ) g(W ))] =E[Dg(W )]: These observations yield E [D(g(W 0 )g(W ) g(W )(D + 1))] =E 2 4 D 8 < : W 0 1 X i=W+1 I f0<DMg [g(i) g(W )] W1 X i=W 0 +1 I fMD<0g [g(i) g(W )] 9 = ; 3 5 =E 2 4 D W 0 1 X i=W+1 I f0<DMg (i)g(i) (W)g(W ) 3 5 (4.17) E " D W1 X i=W 0 +1 I fMD<0g (i)g(i) (W)g(W ) # +E 2 4 D W 0 1 X i=W+1 I f0<DMg I fi2Ag I fW2Ag 3 5 (4.18) E " W1 X i=W 0 +1 I fMD<0g I fi2Ag I fW2Ag # =J 2;1 +J 2;2 ; 102 where J 2;1 is (4.17) and J 2;2 is (4.18). NotejJ 2 jjJ 2;1 j +jJ 2;2 j, and that jJ 2;1 jE DI f0<DMg W 0 1 X i=W+1 (iW )g(i) (W)(g(W )g(i)) +E DI f0>DMg W1 X i=W 0 +1 (iW )g(i) (W)(g(W )g(i)) =J + 2;1 +J 2;1 : Next, we have J + 2;1 E 2 4 jDjI f0<DMg W 0 1 X i=W+1 0 @ j(iW )g(i)j +j(W) iW1 X j=0 g(W +j)j 1 A 3 5 E 2 4 jDjI f0<DMg W 0 1 X i=W+1 jD 1j 1=2 + j(D 1)(W)j 3 5 (4.19) M(M 1) 2 3=2 E 1 + jWj 1=2 I f0<DMg : The second inequality follows by Lemma 2.11. Nearly identical computations can be used to obtain J 2;1 M(M 1) 2 3=2 E 1 + jWj 1=2 I f0>DMg : (4.20) Combining (4.19) and (4.20) and an application of the Cauchy-Schwarz inequality yields jJ 2;1 jJ + 2;1 +J 2;1 M(M 1) 2 3=2 1 + 1=2 : 103 For the nal term, we have jJ 2;2 j E 2 4 DI f0<DMg W 0 1 X i=W+1 I fi2Ag I fW2Ag 3 5 + E " DI f0>DMg W1 X i=W 0 +1 I fi2Ag I fW2Ag # = M1 X i=1 E DI fi+1DMg I fW+i2Ag I fW2Ag (4.21) + M1 X i=1 E DI fMDi1g I fWi2Ag I fW2Ag : (4.22) To bound (4.21), notice that for 1i (M 1) and C + i =E[DI fi+1DMg ], jE[DI fi+1DMg [I fW+i2Ag I fW2Ag ]]j C + i jP(W +i2A)P(W2A)j + E (DI fi+1DMg C + i ) I fW+i2Ag I fW2Ag iC + i d TV (W + 1;W ) +E E W DI fi+1DMg C + i iC + i d TV (W + 1;W ) + M X j=i+1 j q Var (P W (D =j)): Similarly, For 1i (M 1) and C i =E DI fMDi1g , jE[DI fMDi1g [I fWi2Ag I fW2Ag ]]j ijC i jd TV (W + 1;W ) + M X j=i+1 j q Var (P W (D =j)): The theorem now follows from the fact that both C + j and C j are bounded above by M. 104 4.2 Translated Poisson Distribution The translated Poisson distribution has recently been used as a target distribution for approxi- mation of integer valued random variables. It is dened as a Poisson distribution shifted to better match rst and second moments and, by considering characteristic functions (see the discussion of [30]), is a candidate for a discrete version of the normal distribution. Typically, integer valued random variables converging to the normal distribution have variances approaching innity, so that convergence to the translated Poisson distribution will not occur. However, the utility of the distribution is that it can be used for approximation of integer valued random variables in total variation distance, where the normal distribution is useless. Other candidates for a discrete analog of the normal distribution have been considered; notable examples include the signed, translated, compound Poisson distribution [12], the discrete zero bias distribution [53], and the symmetric, centered binomial distribution [72]. The rst two apply mainly to sums of independent integer valued random variables, while the last applies to sums of integer valued random variables with local dependency structures. Bearing this in mind, our aim is to prove an exchangeable pairs translated Poisson approxima- tion theorem as similar as possible to the preexisting exchangeable pairs normal approximation theorems. Ideally, such a theorem would provide nearly out the door error in approximation to the translated Poisson distribution in many examples where the error in approximation to the normal distribution has already been computed using exchangeable pairs. This program has previously been developed [70], but the main theorem there has restrictions that will be lifted here at the cost of additional error terms. We now make these ideas more precise. In order to rigorously dene the translated Poisson distribution, let s =b 2 c, and = 2 s. We say the random variableZ has the translated Poisson distribution with parameters and 2 if Zs is a Poisson random variable with mean 2 + . Notice that E(Z) = and 2 Var(Z)< 2 + 1, where equality holds if 2 is integer valued. 105 From this denition, the most obvious strategy for approximating a random variable W by a translated Poisson distribution in total variation distance is to approximate Ws by the Poisson Zs, as the metric is translation invariant. Ideally, existing Poisson approximation results could easily be applied in this way. However, most Stein's method Poisson approximation theorems rely on decomposingW as a sum of indicator random variables, from which it is unclear how to proceed with this strategy (the main technical issue being how to distribute the translation s among the summands while retaining a lattice structure). Refreshingly, the method of exchangeable pairs is well suited to translated Poisson approximation, as the translation does not aect terms of the form W 0 W (that is, if (W;W 0 ) is an exchangeable pair, then so is (Ws;W 0 s)). Any of the previous exchangeable pairs Poisson approximation results (Theorems 2.13, 2.15, and 4.1 and Lemma 4.5) can easily be adapted to translated Poisson approximation. The next two theorems illustrate this point. We state the rst mainly to illustrate the method of proof to convert exchangeable pairs Poisson approximation theorems into translated Poisson approximation theorems. Theorem 4.6. Let W a random variable with integer support, E(W ) = , and Var(W ) = 2 . Let (W;W 0 ) an exchangeable pair of real random variables such that E W (W 0 W ) =a(W) with 0<a 1. Then with Z a translated Poisson random variable with parameters and 2 , d TV (W;Z) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a 2 + E[jW 0 Wj 3 jW 0 Wj] 2a 2 + 2 2 : (4.23) Proof. By Theorem 2.15, we have d TV (Ws;Zs) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a( 2 + ) + 1 2 2 + + E[jW 0 Wj 3 jW 0 Wj] 2a( 2 + ) +P(Ws< 0): 106 The theorem now follows because total variation distance is translation invariant, 0 < 1, and as shown in [70], Chebyshev's inequality yields P(Ws< 0) 1= 2 . Remarks. 1. If (W 0 W )2f1; 0; 1g then Theorem 4.6 reduces to the main result of [70]. 2. Theorem 4.6 bears a strong resemblance to Theorem 2.5 for normal approximation. The main dierence is that that the rst term of (4.23) has a factor of 2 in the denominator rather than the implicit factor of 3 in the denominator ofjW 0 Wj 3 in the second term of (2.11). The extra factor of is generally crucial in obtaining useful bounds in Theorem 2.5. Theorem 4.6 will be useful only in the case where the exchangeable pair changes by more than one with a small probability. The next result can be proved by applying Lemma 4.5 in a manner similar to the proof of Theorem 4.6. It generalizes the previous theorem and contains error terms of the correct order in typical situations (that is, the second remark above does not apply). Theorem 4.7. Let W a random variable with integer support, E(W ) = , and Var(W ) = 2 . Let (W;W 0 ) an exchangeable pair of real random variables such that E W (W 0 W ) =a(W) with 0 < a 1. Then with Z a translated Poisson random variable with parameters and 2 , and M a natural number, d TV (W;Z) p Var(E W [(W 0 W )(W 0 W + 1)]) 2a 2 + 2 2 (4.24) + E I fjW 0 Wj>Mg jW 0 Wj 3 jW 0 Wj 2a 2 + M 2 (M 1)d TV (W;W + 1) a 2 (4.25) + M(M 1) 2 a 3 + M(M 1) a 2 M X jjj=2 q Var(P W (W 0 W =j)): (4.26) Remarks. 1. Choosing M = 1 in the theorem recovers Theorem 4.6. 107 2. If (W 0 W )2f1; 0; 1g then Theorem 4.7 reduces to the main result of [70]. As described in the introduction, the program here is to derive an exchangeable pairs translated Poisson approximation theorem similar to exchangeable pairs normal approximation theorems in the literature. To wit, we compare Theorem 4.7 to Theorem 2.8, the normal approximation theorem stated at the end of Section 2.3. Let W be an integer valued random variable depending on some parameter n with mean n and variance 2 n . Assume Theorem 2.8 can be applied to yield a bound on the Kolmogorov distance between (W n )= n and a standard normal random variable, where theM of Theorem 2.8 is equal to M 0 = (so that the M of Theorem 4.7 will be M 0 ). Typically 2 n is of order n, a is of order n 1 , andjW 0 Wj is bounded by a constant with high probability. In this typical situation, the bound from Theorem 2.8 will be of ordern 1=2 . What can we say about the bound in Theorem 4.7 based only on these assumptions? The rst term of (4.24) diers slightly from the rst term of (2.21). However, using Proposition 4.2, we obtain Var[E W [(W 0 W )(W 0 W + 1)]] 3Var[E W (W 0 W ) 2 ] + 3Var[E W (W 0 W )]; and the hypotheses of the theorem imply Var[E W (W 0 W )] =a 2 2 n , a term of order n 1 . Thus this slight dierence between the two theorems will not change the order of the error. The nal term of (2.21) can be rewritten in our setup as (2a 2 n ) 1 E (W 0 W ) 2 I fjW 0 Wj>M 0 g , which should not dier greatly from the rst term of (4.25), as we assumejW 0 Wj is bounded by a constant with high probability. Also, in this setup, the rst term of (4.26) and the second term of (4.24) are of order n 1=2 and n 1 , respectively. The main dierence between the two theorems is the addition of the last term of (4.25) and (4.26). These terms quantify the \smoothness" of the distribution ofW . The termd TV (W;W +1) appears in other approximation theorems for discrete analogs of the normal distribution including 108 those of [12, 53, 72] mentioned in the introduction of this section. Heuristically, this term accounts for the fact that the total variation distance between W and a \nice" integer valued random variable should not be small ifW has a distribution not supported properly on the integer lattice. For example, if the support of W is a subset of the even integers, then we have d TV (W;Z) = 1 2 X j2Z jP(W =j)P(Z =j)j 1 2 X j2f2Z+1g jP(Z =j)j 1 4 ; and also d TV (W;W + 1) = 1. In the typical situation under study, we aim for d TV (W;W + 1) to be of order n 1=2 ; similar requirements are found in [12, 53, 72]. Finally, note that Theorem 4.8, which was proved in the Section 1.1, d TV (W;W + 1) can be bounded above in terms of p Var(P W (W 0 =W+1)) P(W 0 =W+1) and p Var(P W (W 0 =W1)) P(W 0 =W+1) , terms that t in well with the theorem. The second term of (4.26) does not appear in other formulations for normal approximation using exchangeable pairs, and thus is the main technical obstruction in applying Theorem 4.7. It can be dicult to bound in concrete settings, and in the typical situation under study should be of order n 1=2 . 4.3 Smoothing Inequality As mentioned in the previous section, many discrete distributional approximation theorems con- tain the error termd TV (W;W +1) (Theorem 4.7 above, and the main results of [12, 53, 72]). This term can be dicult to compute in concrete examples, especially if dependence plays a role. Here we present some examples where Theorem 4.8, proved in Section 1.1, can be be used to easily handle d TV (W;W + 1). We will usually use Theorem 4.8 in a more general form than stated in the introduction. 109 Theorem 4.8. LetW a random variable with integer support, (W;W 0 ) an exchangeable pair, and Y a random vector such that (W )(Y). Then d TV (W;W + 1) p Var[P Y (W 0 =W + 1)] + p Var[P Y (W 0 =W 1)] P(W 0 =W + 1) : Proof. The theorem follows from the proof presented in Section 1.1 and the fact [36] that if 1 and 2 are sigma algebras with the property that 2 1 , then Var(E[Yj 2 ])Var(E[Yj 1 ]). Remark. In the case thatjW 0 Wj 1 andE W (W 0 W ) =a(WE(W )), we have, using a similar argument starting from (1.1), the estimate d TV (W;W + 1) p Var[P W (W 0 =W + 1)] a 2 + 1 ; where 2 = Var[W ]. From Theorem 4.6, this same term (with some additional smaller order terms) is an upper bound for the total variation distance ofW to a translated Poisson distribution. The novel utility of Theorem 4.8 is that it can give results outside of the usual case of sums of independent integer valued random variables. However, something more specic can be stated for this case. If W = P n i=1 X i , with the X i independent and integer valued, exchangeable pairs are typically constructed by dening W 0 = WX I +X 0 I , where I is chosen independently and uniformly fromf1:::ng, and for eachi,X 0 i is an independent copy ofX i . Because the bounds in the theorem are usually small whenWW 0 is small (as in the analysis of Chapter 3), we will relax independence ofX 0 i in the construction above to the condition that (X i ;X 0 i ) are exchangeable. In this setting we have the following. 110 Corollary 4.9. If W = P n i=1 X i , a sum of independent integer valued random variables, and for each i, (X i ;X 0 i ) is an exchangeable pair, then d TV (W;W + 1) p P n i=1 Var[P(X 0 i =X i + 1jX i )] + p P n i=1 Var[P(X 0 i =X i 1jX i )] P n i=1 P(X 0 i =X i + 1) 2 p P n i=1 P(X 0 i =X i + 1) : Proof. After noting (W ) (fX i g n i=1 ), the rst inequality follows from Theorem 4.8 and the remarks preceding the statement of the corollary. The second inequality follows after noting that for a random variable Y such that 0Y 1,EY 2 EY which implies Var(Y )EY . Remark. The usual method of bounding d TV (W;W + 1), where W = P n i=1 X i , a sum of in- dependent integer valued random variables, uses the Mineka coupling [58] and yields the result found in [12]. However, a recent improvement has been developed [60] which uses concentration inequalities to obtain d TV (W;W + 1) r 2 " 1 4 + n X i=1 (1d TV (X i ;X i + 1)) # 1=2 : (4.27) In the next following, we treat the examples of the sum of Bernoulli trials, the coupon collector problem, and two runs in a sequence of Bernoulli trials. 4.3.1 The Sum of Bernoulli Trials In this section we will bound d TV (W;W + 1), where W = P n i=1 X i , and each X i is independent andP(X i = 1) = 1P(X i = 0) =p i . DeneW 0 =WX I +X 0 I , whereI is chosen independently 111 and uniformly fromf1:::ng, and for each i, X 0 i is an independent copy of X i . It is easy to see that we have P(W 0 =W + 1jfX i g) = 1 n n X i=1 (1X i )p i ; P(W 0 =W 1jfX i g) = 1 n n X i=1 X i (1p i ); so that E W (W 0 W ) = (WE(W )) n : By the remark following Theorem 4.8, d TV (W;W + 1) p P n i=1 p 3 i (1p i ) 2 + 1 2 : Although this example seems trivial, it is in fact far reaching because the bound is expressed entirely in terms of the variance of W . Thus, if it is known that a random variable has a representation as a sum of Bernoulli random variables, its aperiodicity can be deduced directly from the variance. In fact, the existence of such a representation for a random variable W is equivalent to the probability generating function of W having only real roots [64]. Moreover, many well studied random variables such as the number of cycles of a random permutation, the number of components of a random partition, and the hypergeometric distribution possess this property (for a much more complete list, see [64]). Now, to compare this result to (4.27), note that in this case d TV (X i ;X i + 1) = maxfp i ; (1p i )g 1p i (1p i ); 112 so that (4.27) yields d TV (W;W + 1) r 2 1 4 + 2 1=2 r 2 ! 1 ; an improvement of the result above only in constant. We mention here that in this case, Theorem 4.6 implies that it is not necessary to compute d TV (W;W +1) to obtain a translated Poisson approximation theorem. Also, as the computations above show, the main result here can be derived using existing (although more technical) theory. However, we have included this example to illustrate the facility by which Theorem 4.8 can obtain neat results. 4.3.2 Coupon Collector's Problem In the classical coupon collector's model, labeled coupons are chosen uniformly at random (with replacement) from n distinct coupons. For m2f0; 1;:::;n 1g, dene the random variable W to be the number of samples until the appearance of nm distinct coupons. If we dene the geometric random variable X with success rate p to have distribution P(X = k) = (1p) k1 p for k 1, then it is not dicult to see that W = P n i=m+1 X i , where each X i is an independent geometric random variable with success rate (i=n). The asymptotic distribution of W as n tends to innity is well studied [13, 65]; the particular case of convergence to the normal distribution relates to Theorem 4.8 through translated Poisson approximation. We have the following result. Theorem 4.10. Form depending onn, if (nm)n 1=2 !1 andm!1 asn tends to innity, then (W)= converges in distribution to a standard normal random variable, where and 2 are the mean and variance of W , respectively. Using the method of [12], [65] proves that under the hypotheses of Theorem 4.10, the total variation distance betweenW and a compound translated Poisson random variable is smaller than an explicit term tending to zero. A key step in the analysis there (and in the method of [12] in 113 general) is to obtain rates on the convergence of d TV (W;W + 1) to zero (under the hypotheses of Theorem 4.10). In this section, we will use Corollary 4.9 to bound d TV (W;W + 1), and compare the bounds to those obtained in [65]. Theorem 4.11. For W = P n i=m+1 X i , where each X i is an independent geometric random variable with success rate (i=n), d TV (W;W + 1) 4 (nm)(nm 1) n 1=2 : Proof. In order to apply Corollary 4.9, we rst need to dene the exchangeable pairs (X i ;X 0 i ). One powerful method of creating reversible Markov chains (seemingly out of thin air) is to use the Metropolis algorithm [32], which takes any Markov chain dened on the state space of interest and transforms it into a Markov chain reversible with respect to a given distribution. We omit the details here, but starting from a Markov chain with state space the natural numbers (the support of a geometric random variable), which follows the rule of increasing by one with probability 1=2 and decreasing by one with probability 1=2 (or holding with probability 1=2 if currently at 1), the Metropolis algorithm yields the following probabilities. P(X 0 i X i = 1jX i ) = ni 2n ; X i 1; P(X 0 i X i =1jX i ) = 1 2 ; X i > 1; P(X 0 i = 1jX i = 1) = n +i 2n ; P(X 0 i X i = 0) = i 2n ; X i > 2: Regardless of construction, it is easy to check that the transition probabilities above are indeed reversible with respect to a geometric distribution with parameter i=n. Now, using the probabilities above in applying Corollary 4.9, we obtain 114 d TV (W;W + 1) 2 n1 X i=m+1 ni 2n ! 1=2 ; which is the theorem. In [65], the authors obtain that d TV (W;W + 1) is asymptotically of order (Var[W ]) 1=2 under the hypotheses of Theorem 4.10. The following proposition (the proof of which is elemen- tary) shows that for m=n bounded away from zero, Theorem 4.11 implies that d TV (W;W + 1) is asymptotically of order (Var[W ]) 1=2 . Using a dierent exchangeable pair might possibly yield the superior bound in the case that m=n tends to zero. 1 Proposition 4.12. [13] For W = P n i=m+1 X i , where each X i is an independent geometric ran- dom variable with success rate (i=n), we have the following asymptotic estimates of variance. 1. If m=n! 0, then Var(W )n 2 =m, 2. If m=n!c2 (0; 1), then Var(W ) 1c+c log(c) c n, 3. If m=n! 1, then Var(W ) (nm) 2 2n . 4.3.3 Two Runs in a Bernoulli Sequence LetW = P n i=1 X i X i1 , where eachX i is an independent Bernoulli random variable with success ratep i andp 0 = 0 (that isX 0 = 0). The example of two runs in a Bernoulli sequence was treated in [12] in the context of signed, translated, compound Poisson approximation. The method employed requires a bound ond TV (W;W +1), achieved there by a clever coupling construction; their results are the following. 1 Preliminary experimentation suggests that starting from a nearest neighbor walk and applying the Metropolis algorithm in concert with Corollary 4.9 will not yield improved rates. A dierent strategy would be to create a reversible chain using orthogonal polynomials, similar to the construction of the chains of Chapter 3. 115 Lemma 4.13. [12] For W dened as above, we have d TV (W;W + 1) 4:6 " n X i=2 (1p i2 ) 2 p i1 (1p i1 )p i # 1=2 Here we obtain bounds of comparable order with less work using Theorem 4.8 in the following lemma. After proving the lemma, we will make a comparison to Lemma 4.13. Lemma 4.14. For W dened as above, we have d TV (W;W + 1) 2 p 3 " n X i=2 p i1 (1p i1 ) p i2 (1p i ) +p i (1p i2 ) # 1=2 : Proof. Dene W 0 by choosing an index i uniformly at random and replacing X i in W with an independent copy X 0 i . Then we have P(W 0 =W + 1jfX i g) = 1 n n X i=2 p i1 X i2 (1X i1 )(1X i ) + 1 n n X i=2 p i1 (1X i2 )(1X i1 )X i ; P(W 0 =W 1jfX i g) = 1 n n X i=2 (1p i1 )(1X i2 )X i1 X i + 1 n n X i=2 (1p i1 )X i2 X i1 (1X i ): For example, the rst equality follows because the only way to increase the number of two runs by one, is to choose the middle zero in a sequence 1; 0; 0 or 0; 0; 1 and change it to a one. From this 116 point we must compute the terms from Theorem 4.8; the rst term we compute is the denominator of the error. P(W 0 W = 1) =E [P(W 0 =W + 1jfX i g)] = 1 n n X i=2 p i1 p i2 (1p i1 )(1p i ) + 1 n n X i=2 p i1 (1p i2 )(1p i1 )p i : (4.28) For the next term, apply Proposition 4.2 to obtain Var[P(W 0 =W + 1jfX i g)] 3 n 2 Var " n X i=2 p i1 X i2 (1X i1 )(1X i ) # (4.29) +Var " n X i=2 p i1 (1X i2 )(1X i1 )X i # : (4.30) For (4.29), we have Var n X i=2 p i1 X i2 (1X i1 )(1X i ) = n X i=2 p 2 i1 Var [X i2 (1X i1 )(1X i )] (4.31) + 2 n X i=2 n X j=i+1 p i1 p j1 Cov [X i2 (1X i1 )(1X i );X j2 (1X j1 )(1X j )]: (4.32) Each summand of (4.31) contains a variance of an indicator function which can be evaluated as Var[X i2 (1X i1 )(1X i )] =p i2 (1p i1 )(1p i ) [1p i2 (1p i1 )(1p i )] p i2 (1p i1 )(1p i ): 117 For the term (4.32), note that for j >i + 2, the independence of the Bernoulli indicators implies the covariance in the summand will be zero. For j =i + 1;i + 2, we have E [X i2 (1X i1 )(1X i );X j2 (1X j1 )(1X j )] = 0; so that these covariance terms are negative (and thus can be ignored). Similar calculations on (4.30) yield Var[P(W 0 =W + 1jfX i g)] 3 n 2 n X i=2 p 2 i1 (1p i1 )[p i2 (1p i ) +p i (1p i2 )]; and a nearly identical straightforward analysis yields Var[P(W 0 =W 1jfX i g)] 3 n 2 n X i=2 p i1 (1p i1 ) 2 [p i2 (1p i ) +p i (1p i2 )]: Finally, noticing from the computation (4.28) that both of Var[P(W 0 = W 1jfX i g)] and Var[P(W 0 =W + 1jfX i g)] are bounded above by 3 n P(W 0 W = 1); and then applying Theorem 4.8 yields the lemma. The bound of Lemma 4.14 was obtained rather crudely so that the nal expression would have a form comparable to the bound of Lemma 4.13. Going back into the proof of the lemma, one could obtain a better bound by not throwing out as many terms as we did. That being said, by comparing individual summands (and the constants), it is easy to see that the upper bound 118 proved here in Lemma 4.14 is smaller than the bound of Lemma 4.13, and was achieved in a technically easier and conceptually more straightforward manner. 4.3.4 Isolated Vertices of a Random Graph In this subsection, we will use Theorem 4.8 to obtain a result that has not yet been treated elsewhere. DeneG(n;p) to be a random graph with n vertices where each edge appears with probability p, independent of all other edges. Let W be the number of isolated vertices (that is, vertices with no edges) ofG(n;p). From this point, we have the following theorem. Theorem 4.15. [11] For Z a normal random variable with mean and variance equal to those of W , lim n!1 d K (W;Z) = 0 if and only if lim n!1 n 2 p = lim n!1 (log(n)np) =1, where d K denotes Kolmogorov distance. In [11], the authors obtain bounds on d K (W;Z) which are of suboptimal order; bounds of the (correct) order [Var(W )] 1=2 are obtained in [57]. As the previous discussion indicates, it is likely that in order to extend this result to translated Poisson approximation in the stronger variation distance, a bound on d TV (W;W + 1) will be needed. In fact, Theorem 4.8 can be applied to obtain the following. Lemma 4.16. Let W be the number of isolated vertices ofG(n;p). Then we have d TV (W;W + 1)(n;p); where (n;p) is an explicit error term such that 1. (n;p) =O([Var(W )] 1=2 ) if lim n!1 np = lim n!1 (log(n)np) =1. 2. (n;p) =O([Var(W )] 1=2 ) if lim n!1 np =c> 0. 3. (n;p) =O([npVar(W )] 1=2 ) if lim n!1 np = 0 and lim n!1 n 2 p =1. 119 Remark. The error term given in the lemma is of the desired order in the rst two cases, but not in the third. In this case, we shall see thatVar(W ) =n 2 p so that(n;p) tends to zero ifn 3 p 2 tends to innity. This requirement is stronger than what is needed for convergence to normality as dictated by Theorem 4.15 (that is, p =O(n ) yields convergence to normality for 1< 2 but (n;p) converges to zero only for 1< 3=2). Proof. In order to apply Theorem 4.8, we construct a suitable exchangeable pair. GivenG(n;p), deneG 0 (n;p) by choosing an edge ofG(n;p) uniformly at random and resampling. LettingW be the number of isolated vertices ofG(n;p), and W 0 be the number of isolated vertices ofG 0 (n;p), it is easy to see that (W;W 0 ) is an exchangeable pair. In order to compute the terms needed to apply Theorem 4.8, we need some auxiliary random variables. Let W k be the number of vertices of degree k inG(n;p) (so that W 0 =W ), and E 2 be the number of connected pairs vertices each having degree one (that is, the number of isolated edges). Then we have P(W 0 =W + 1jG) = W 1 2E 2 n 2 ! (1p); (4.33) P(W 0 =W 1jG) = W (nW ) n 2 ! p: (4.34) To see these equalities, notice that in order to increase the number of isolated vertices by one, an edge must be chosen that has exactly one end vertex of degree one, and then must be removed upon resampling. To decrease the number of isolated vertices by one, an isolated vertex must be connected to a vertex with positive degree. Now we can begin to calculate the terms needed apply Theorem 4.8. From (4.33), in order to calculateP(W 0 =W + 1), we need the rst moments ofW 1 andE 2 . WriteW 1 = P n i=1 X i , where X i is the indicator that vertex i has degree one, and E 2 = P i6=j Y ij , where Y ij is the indicator 120 that there is an edge between vertices i and j and each vertex has degree one. Then it follows that EW 1 = 2 n 2 p(1p) n2 ; EE 2 = n 2 p(1p) 2n4 ; so that P(W 0 =W + 1) = 2p(1p) n1 1 (1p) n2 : Using the decompositions ofW 1 andE 2 above, it is a straightforward counting exercise to obtain the remaining moment information needed to compute Var[P(W 0 =W + 1jG)]: EW 2 1 = 2 n 2 p(1p) n2 + 2p n 2 (1p) 2n4 +p(n 2) 2 (1p) 2n5 ; EE 2 2 = n 2 p(1p) 2n4 + 6 n 4 p 2 (1p) 4n12 ; EW 1 E 2 = n 2 p(1p) 2n4 (n 2)(n 3)p(1p) n4 + 2 : Using these equalities we have p Var[P(W 0 =W + 1jG)] P(W 0 =W + 1) " n 2 f 2 (n;p) +nf 1 (n;p) +f 0 (n;p) 2 n 2 p(1p) n2 (1 (1p) n2 ) 2 # 1=2 ; where f 2 (n;p) =p(1p) n3 h 1 (1p) n3 2 (1p) 1 (1p) n2 2 +p(1p) 2n7 i ; f 1 (n;p) =p(1p) n3 h 4 5(1p) 2n7 + 10(1p) n3 + (1p) 1 (1p) n2 2 i ; f 0 (n;p) = 1 (1p) n2 +p(1p) n3 4 + 6(1p) 2n7 12(1p) n3 : 121 WritingW as a sum of indicators, some more involved (but still straightforward) calculations yield the moment information needed to obtain Var[P(W 0 =W 1jG)]: EW =n(1p) n1 ; EW 2 =n(1p) n1 + 2 n 2 (1p) 2n3 ; EW 3 =n(1p) n1 + 6 n 2 (1p) 2n3 + 6 n 3 (1p) 3n6 ; EW 4 =n(1p) n1 + 14 n 2 (1p) 2n3 + 36 n 3 (1p) 3n6 + 24 n 4 (1p) 4n10 : Using these equalities we have p Var[P(W 0 =W 1jG)] P(W 0 =W + 1) " n 2 g 2 (n;p) +ng 1 (n;p) +g 0 (n;p) 2 n 2 (1p) n1 (1 (1p) n2 ) 2 # 1=2 ; where g 2 (n;p) = (1p) n2 h 1 (1p) n3 2 (1p) 1 (1p) n2 2 +p(1p) 2n7 i ; g 1 (n;p) = 1 6(1p) n2 + 10(1p) 2n5 5(1p) 3n9 + (1p) n1 1 (1p) n2 2 ; g 0 (n;p) =1 + 7(1p) n2 12(1p) 2n5 + 6(1p) 3n9 : The upper bound of the lemma follows from Theorem 4.8 by dening (n;p) = " n 2 f 2 (n;p) +nf 1 (n;p) +f 0 (n;p) 2 n 2 p(1p) n2 (1 (1p) n2 ) 2 # 1=2 + " n 2 g 2 (n;p) +ng 1 (n;p) +g 0 (n;p) 2 n 2 (1p) n1 (1 (1p) n2 ) 2 # 1=2 : 122 To show the asymptotic assertions, rst note thatVar(W ) =n(1p) n1 [1+(1p) n1 (np1)], (1p) n e pn , and also that (1p) n2 (1p) n3 implies f 2 (n;p)p 2 (1p) n3 [ 1 (1p) n2 2 + (1p) 2n7 ]; g 2 (n;p)p(1p) n2 [ 1 (1p) n2 2 + (1p) 2n7 ]; which yields a necessary extra factor of p in these terms. As a nal preliminary, for k = 0; 1; 2; dene F k (n;p) = n k f k (n;p) 2 n 2 p(1p) n2 (1 (1p) n2 ) 2 ; G k (n;p) = n k g k (n;p) 2 n 2 (1p) n1 (1 (1p) n2 ) 2 ; so that (n;p) = [F 2 (n;p) +F 1 (n;p) +F 0 (n;p)] 1=2 + [G 2 (n;p) +G 1 (n;p) +G 0 (n;p)] 1=2 : Case 1: lim n!1 np = lim n!1 (log(n)np) =1. In this case, lim n!1 (np) k (1p) n = 0 for all natural numbersk, so thatVar(W )n(1p) n . It is immediate from the previous remarks that we have F 2 (n;p) =O(p); G 2 (n;p) =O(p); F 1 (n;p) =O(n 1 ); G 1 (n;p) =O([n(1p)] 1 ); F 0 (n;p) =O([n 2 (1p)] 1 ); G 0 (n;p) =O([n 2 (1p)] 1 ); which are all terms of order at most Var(W ) 1 . Thus (n;p) =O([Var(W )] 1=2 ) for this case. 123 Case 2: lim n!1 np =c> 0. In this case, lim n!1 (np) k (1p) n = c k e c for all natural numbers k, so that Var(W ) n. It is immediate from the previous remarks that we have F 2 (n;p) =O(p); G 2 (n;p) =O(p); F 1 (n;p) =O(n 1 ); G 1 (n;p) =O(n 1 ); F 0 (n;p) =O(n 2 ); G 0 (n;p) =O(n 2 ); which are all terms of order at most Var(W ) 1 . Thus (n;p) =O([Var(W )] 1=2 ) for this case. Case 3: lim n!1 np = 0 and lim n!1 n 2 p =1. In this case, lim n!1 (1p) n = 1, so thatVar(W )n 2 p and (1 (1p) n ) 1e np np. It follows from the previous remarks that for we have F 2 (n;p) =O([n 2 p] 1 ); G 2 (n;p) =O([n 2 p] 1 ); F 1 (n;p) =O([n 3 p 2 ] 1 ); G 1 (n;p) =O([n 3 p 2 ] 1 ), F 0 (n;p) =O([n 4 p 2 ] 1 ); G 0 (n;p) =O([n 4 p 2 ] 1 ). This bound is suboptimal in order because both F 1 and G 1 are of order [npVar(W )] 1 , and np is tending to zero. Actually, a more careful analysis yields G 1 (n;p) =O(n 2 p), but the order of F 1 is correct as stated above. 124 Bibliography [1] Richard Arratia, Andrew D. Barbour, and Simon Tavar e. Logarithmic combinatorial struc- tures: a probabilistic approach. EMS Monographs in Mathematics. European Mathematical Society (EMS), Z urich, 2003. [2] Richard Arratia, Larry Goldstein, and Louis Gordon. Two moments suce for Poisson approximations: the Chen-Stein method. Ann. Probab., 17(1):9{25, 1989. [3] Richard Arratia, Larry Goldstein, and Louis Gordon. Poisson approximation and the Chen- Stein method. Statist. Sci., 5(4):403{434, 1990. With comments and a rejoinder by the authors. [4] Richard Arratia and Simon Tavar e. The cycle structure of random permutations. Ann. Probab., 20(3):1567{1591, 1992. [5] Richard Arratia and Simon Tavar e. Limit theorems for combinatorial structures via discrete process approximations. Random Structures Algorithms, 3(3):321{345, 1992. [6] Richard Askey. Orthogonal polynomials and special functions. Society for Industrial and Applied Mathematics, Philadelphia, Pa., 1975. [7] Pierre Baldi and Yosef Rinott. On normal approximations of distributions in terms of de- pendency graphs. Ann. Probab., 17(4):1646{1650, 1989. [8] Pierre Baldi, Yosef Rinott, and Charles Stein. A normal approximation for the number of local maxima of a random function on a graph. In Probability, statistics, and mathematics, pages 59{81. Academic Press, Boston, MA, 1989. [9] Eiichi Bannai and Tatsuro Ito. Algebraic combinatorics. I. The Benjamin/Cummings Pub- lishing Co. Inc., Menlo Park, CA, 1984. Association schemes. [10] Andrew D. Barbour, Lars Holst, and Svante Janson. Poisson approximation, volume 2 of Oxford Studies in Probability. The Clarendon Press Oxford University Press, New York, 1992. Oxford Science Publications. [11] Andrew D. Barbour, Micha l Karo nski, and Andrzej Ruci nski. A central limit theorem for decomposable random variables with applications to random graphs. J. Combin. Theory Ser. B, 47(2):125{145, 1989. [12] Andrew D. Barbour and Aihua Xia. Poisson perturbations. ESAIM Probab. Statist., 3:131{ 150 (electronic), 1999. [13] Leonard E. Baum and Patrick Billingsley. Asymptotic distributions for the coupon collector's problem. Ann. Math. Statist., 36:1835{1839, 1965. [14] Erwin Bolthausen. An estimate of the remainder in a combinatorial central limit theorem. Z. Wahrsch. Verw. Gebiete, 66(3):379{386, 1984. 125 [15] Pierre Br emaud. Markov chains, volume 31 of Texts in Applied Mathematics. Springer- Verlag, New York, 1999. Gibbs elds, Monte Carlo simulation, and queues. [16] Timothy C. Brown and Michael J. Phillips. Negative binomial approximation with Stein's method. Methodol. Comput. Appl. Probab., 1(4):407{421, 1999. [17] Theolos Cacoullos, Vassilis Papathanasiou, and Sergey A. Utev. Variational inequalities with examples and an application to the central limit theorem. Ann. Probab., 22(3):1607{ 1618, 1994. [18] Sourav Chatterjee. Concentration inequalities with exchangeable pairs. http://arxiv.org/ math.PR/0507526, 2005. Ph.D. dissertation, Stanford University. [19] Sourav Chatterjee. Stein's method for concentration inequalities. Probab. Theory Related Fields, 138(1-2):305{321, 2007. [20] Sourav Chatterjee. A new method of normal approximation. Ann. Probab., 36(4):1584{1610, 2008. [21] Sourav Chatterjee. Fluctuations of eigenvalues and second order Poincar e inequalities. Probab. Theory Related Fields, 143(1-2):1{40, 2009. [22] Sourav Chatterjee, Persi Diaconis, and Elizabeth Meckes. Exchangeable pairs and Poisson approximation. Probab. Surv., 2:64{106 (electronic), 2005. [23] Sourav Chatterjee, Jason Fulman, and Adrian Rollin. Exponential approximation by Stein's method and spectral graph theory. http://arxiv.org/abs/math/0605552, 2006. [24] Sourav Chatterjee and Elizabeth Meckes. Multivariate normal approximation using ex- changeable pairs. ALEA Lat. Am. J. Probab. Math. Stat., 4:257{283, 2008. [25] Louis H. Y. Chen. Poisson approximation for dependent trials. Ann. Probability, 3(3):534{ 545, 1975. [26] Louis H. Y. Chen and Qi-Man Shao. A non-uniform Berry-Esseen bound via Stein's method. Probab. Theory Related Fields, 120(2):236{254, 2001. [27] Louis H. Y. Chen and Qi-Man Shao. Normal approximation under local dependence. Ann. Probab., 32(3A):1985{2028, 2004. [28] Louis H. Y. Chen and Qi-Man Shao. Stein's method for normal approximation. In An introduction to Stein's method, volume 4 of Lect. Notes Ser. Inst. Math. Sci. Natl. Univ. Singap., pages 1{59. Singapore Univ. Press, Singapore, 2005. [29] Louis H. Y. Chen and Qi-Man Shao. Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli, 13(2):581{599, 2007. [30] Vydas Chyakanavichyus and Pranas Va tkus. Centered Poisson approximation by the Stein method. Liet. Mat. Rink., 41(4):409{423, 2001. [31] Fraser Daly. Upper bounds for Stein-type operators. Electron. J. Probab., 13:no. 20, 566{587, 2008. [32] Persi Diaconis. The Markov chain Monte Carlo revolution. Bull. Amer. Math. Soc. (N.S.), 46(2):179{205, 2009. [33] Persi Diaconis and Curtis Greene. Applications of Murphy's elements. Stanford University Technical Report 335, 1989. 126 [34] Persi Diaconis and Mehrdad Shahshahani. Generating a random permutation with random transpositions. Z. Wahrsch. Verw. Gebiete, 57(2):159{179, 1981. [35] Persi Diaconis and Mehrdad Shahshahani. On the eigenvalues of random matrices. J. Appl. Probab., 31A:49{62, 1994. Studies in applied probability. [36] Richard Durrett. Probability: theory and examples. Duxbury Press, Belmont, CA, second edition, 1996. [37] Georey K. Eagleson. A characterization theorem for positive denite sequences on the Krawtchouk polynomials. Austral. J. Statist., 11:29{38, 1969. [38] Jason Fulman. Stein's method and non-reversible Markov chains. In Stein's method: expos- itory lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 69{77. Inst. Math. Statist., Beachwood, OH, 2004. [39] Jason Fulman. Stein's method and Plancherel measure of the symmetric group. Trans. Amer. Math. Soc., 357(2):555{570 (electronic), 2005. [40] Jason Fulman. An inductive proof of the Berry-Esseen theorem for character ratios. Ann. Comb., 10(3):319{332, 2006. [41] Jason Fulman. Convergence rates of random walk on irreducible representations of nite groups. J. Theoret. Probab., 21(1):193{211, 2008. [42] Jason Fulman. Stein's method and random character ratios. Trans. Amer. Math. Soc., 360(7):3687{3730, 2008. [43] Jason Fulman. Stein's method and characters of compact Lie groups. Communications in Mathematical Physics, 2009. http://www.springerlink.com/content/22h18551j22387wl. [44] Alison Gibbs and Francis Su. On choosing and bounding probability metrics. Internat. Statist. Rev., 70(3):419{435, 2002. [45] Larry Goldstein. Berry-Esseen bounds for combinatorial central limit theorems and pattern occurrences, using zero and size biasing. J. Appl. Probab., 42(3):661{683, 2005. [46] Larry Goldstein. L 1 bounds in normal approximation. Ann. Probab., 35(5):1888{1930, 2007. [47] Larry Goldstein. A gentle introduction to Stein's method for normal ap- proximation IV. http://www.ims.nus.edu.sg/Programs/stein09/files/A%20Gentle% 20Introduction%204.pdf, 2008. [48] Larry Goldstein. A probabilistic proof of the Lindeberg-Feller central limit theorem. Amer. Math. Monthly, 116(1):45{60, 2009. [49] Larry Goldstein and Mathew D. Penrose. Normal approximation for coverage models over binomial point processes. http://arxiv.org/abs/0812.3084, 2008. [50] Larry Goldstein and Gesine Reinert. Stein's method and the zero bias transformation with application to simple random sampling. Ann. Appl. Probab., 7(4):935{952, 1997. [51] Larry Goldstein and Gesine Reinert. Zero biasing in one and higher dimensions, and appli- cations. In Stein's method and applications, volume 5 of Lect. Notes Ser. Inst. Math. Sci. Natl. Univ. Singap., pages 1{18. Singapore Univ. Press, Singapore, 2005. [52] Larry Goldstein and Yosef Rinott. Multivariate normal approximations by Stein's method and size bias couplings. J. Appl. Probab., 33(1):1{17, 1996. 127 [53] Larry Goldstein and Aihua Xia. Zero biasing and a discrete central limit theorem. Ann. Probab., 34(5):1782{1806, 2006. [54] Soo Thong Ho and Louis H. Y. Chen. An L p bound for the remainder in a combinatorial central limit theorem. Ann. Probability, 6(2):231{249, 1978. [55] Richard E. Ingram. Some characters of the symmetric group. Proc. Amer. Math. Soc., 1:358{369, 1950. [56] Norman L. Johnson, Adrienne W. Kemp, and Samuel Kotz. Univariate discrete distributions. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition, 2005. [57] Wojciech Kordecki. Normal approximation and isolated vertices in random graphs. In Ran- dom graphs '87 (Pozna n, 1987), pages 131{139. Wiley, Chichester, 1990. [58] Torgny Lindvall. Lectures on the coupling method. Wiley Series in Probability and Mathe- matical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Inc., New York, 1992. A Wiley-Interscience Publication. [59] F. Jessie MacWilliams and Neil J. A. Sloane. The theory of error-correcting codes. I. North- Holland Publishing Co., Amsterdam, 1977. North-Holland Mathematical Library, Vol. 16. [60] Lutz Mattner and Bero Roos. A shorter proof of Kanter's Bessel function concentration bound. Probab. Theory Related Fields, 139(1-2):191{205, 2007. [61] Elizabeth S. Meckes. On Stein's method for multivariate normal approximation. http: //arxiv.org/abs/0902.0333, 2009. [62] Vassilis Papathanasiou and S. A. Utev. Integro-dierential inequalities and the Poisson approximation. Siberian Adv. Math., 5(1):120{132, 1995. Siberian Advances in Mathematics. [63] Mathew D. Penrose. Normal approximation for isolated balls in an urn allocation model. http://arxiv.org/abs/0901.3493, 2009. [64] Jim Pitman. Probabilistic bounds on the coecients of polynomials with only real zeros. J. Combin. Theory Ser. A, 77(2):279{303, 1997. [65] Anna P osfai. Poisson approximation in a Poisson limit theorem inspired by coupon collecting. Journal of Applied Probability, 2009. to appear. [66] Gesine Reinert. Couplings for normal approximations with Stein's method. In Microsurveys in discrete probability (Princeton, NJ, 1997), volume 41 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pages 193{207. Amer. Math. Soc., Providence, RI, 1998. [67] Gesine Reinert and Adrian R ollin. Multivariate normal approximation with Stein's method of exchangeable pairs under a general linearity condition. http://arxiv.org/abs/0711.1082, 2007. [68] Yosef Rinott and Vladimir Rotar. On coupling constructions and rates in the CLT for dependent summands with applications to the antivoter model and weighted U-statistics. Ann. Appl. Probab., 7(4):1080{1105, 1997. [69] Yosef Rinott and Vladimir Rotar. Normal approximations by Stein's method. Decis. Econ. Finance, 23(1):15{29, 2000. 128 [70] Adrian R ollin. Translated Poisson approximation using exchangeable pair couplings. Ann. Appl. Probab., 17(5-6):1596{1614, 2007. [71] Adrian R ollin. A note on the exchangeability condition in Stein's method. Statist. Probab. Lett., 78(13):1800{1806, 2008. [72] Adrian R ollin. Symmetric and centered binomial approximation of sums of locally dependent random variables. Electron. J. Probab., 13:no. 24, 756{776, 2008. [73] Nathan Ross. Step size in Stein's method of exchangeable pairs. http://arxiv.org/abs/ 0904.0284, 2009. [74] Bruce E. Sagan. The symmetric group, volume 203 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 2001. Representations, combinatorial algorithms, and symmetric functions. [75] Eugene Seneta. Characterization by orthogonal polynomial systems of nite Markov chains. J. Appl. Probab., 38A:42{52, 2001. Probability, statistics and seismology. [76] Qi-Man Shao and Zhong-Gen Su. The Berry-Esseen bound for character ratios. Proc. Amer. Math. Soc., 134(7):2153{2159 (electronic), 2006. [77] Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pages 583{602, Berkeley, Calif., 1972. Univ. California Press. [78] Charles Stein. Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes|Monograph Series, 7. Institute of Mathematical Statistics, Hayward, CA, 1986. 129
Abstract (if available)
Abstract
Stein's method is a powerful tool used to obtain error terms in distributional approximation and limit theorems. In one formulation, an auxiliary stochastic object termed an exchangeable pair must be created in order to apply the method. This dissertation has two main purposes, the first is to examine the effect that the choice of exchangeable pair has in the computation and quality of the error, and the second is to derive new tools using exchangeable pairs and the main principles of Stein's method.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Stein's method and its applications in strong embeddings and Dickman approximations
PDF
Stein's method via approximate zero biasing and positive association with applications to combinatorial central limit theorem and statistical physics
PDF
Concentration inequalities with bounded couplings
PDF
Limit theorems for three random discrete structures via Stein's method
PDF
Applications of Stein's method on statistics of random graphs
PDF
Stein couplings for Berry-Esseen bounds and concentration inequalities
PDF
Cycle structures of permutations with restricted positions
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Monte Carlo methods of forward backward stochastic differential equations in high dimensions
PDF
Distribution of descents in matchings
PDF
Probabilistic methods and randomized algorithms
PDF
Application of Bayesian methods in association mapping
PDF
Copula methods and dependence concepts
PDF
Geometric bounds for Markov Chain and brief applications in Monte Carlo methods
PDF
From card shuffling to random walks on chambers of hyperplane arrangements
PDF
A study of methods for missing data problems in epidemiologic studies with historical exposures
PDF
Approximating stationary long memory processes by an AR model with application to foreign exchange rate
PDF
Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath al...
PDF
Analysis of robustness and residuals in the Affymetrix gene expression microarray summarization
PDF
Towards a taxonomy of cognitive task analysis methods: a search for cognition and task analysis interactions
Asset Metadata
Creator
Ross, Nathan Forrest
(author)
Core Title
Exchangeable pairs in Stein's method of distributional approximation
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Mathematics
Publication Date
08/06/2009
Defense Date
06/18/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
exchangeable pairs,OAI-PMH Harvest,Stein's method
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Fulman, Jason (
committee chair
), Goldstein, Larry M. (
committee member
), Kempe, David (
committee member
)
Creator Email
nathanfr@usc.edu,ross@stat.berkeley.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2509
Unique identifier
UC1227600
Identifier
etd-Ross-3031 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-181948 (legacy record id),usctheses-m2509 (legacy record id)
Legacy Identifier
etd-Ross-3031.pdf
Dmrecord
181948
Document Type
Dissertation
Rights
Ross, Nathan Forrest
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
exchangeable pairs
Stein's method