Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Noise benefits in Markov chain Monte Carlo computation
(USC Thesis Other)
Noise benefits in Markov chain Monte Carlo computation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOISE BENEFITS IN MARKOV CHAIN MONTE CARLO COMPUTATION by Brandon Franzke A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2015 Copyright 2015 Brandon Franzke ii Contents Contents iii List of Tables vi List of Figures vii Abstract xi 1 Noise Benefits in Markov Chain Monte Carlo 1 1.1 Noise Boosting Markov Chain Monte Carlo . . . . . . . . . . . . . . . 2 1.2 Noise Enhances Simulated Annealing Optimization . . . . . . . . . . . 8 1.3 Noisy Quantum Annealing . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Markov Chain Noise Benefits . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Estimating Bell Curve Tail Thickness in Symmetric-Stable Samples . 17 2 Noise Can Speed Convergence in Markov Chains 19 2.1 Review of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Noise Benefits and Stochastic Resonance . . . . . . . . . . . . . . . . . 23 2.3 Noise Benefits in Markov Chain Density Estimation . . . . . . . . . . . 24 2.4 Markov Chain Noise Benefit Theorem . . . . . . . . . . . . . . . . . . 27 2.5 Markov Chain Noise Benefit Algorithms . . . . . . . . . . . . . . . . . 33 2.6 Markov Chain Experimental Results . . . . . . . . . . . . . . . . . . . 35 2.6.1 Noise Benefits in the Ehrenfest Diusion Model . . . . . . . . . 36 2.6.2 Noise Benefits in a Population Genetics Model . . . . . . . . . 41 2.6.3 Noise Benefits in a Chemical Reaction Model . . . . . . . . . . 45 iii 2.7 Markov Chain Noise Benefit Theorem Simulation . . . . . . . . . . . . 49 2.7.1 One-step Markov Chain Simulation . . . . . . . . . . . . . . . 49 2.7.2 Two-step Markov Chain Simulation . . . . . . . . . . . . . . . 52 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3 Noisy Markov Chain Monte Carlo (N-MCMC) 55 3.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.2 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1.3 Markov Chains Govern Transitions Between States . . . . . . . 66 3.1.4 MCMC in Bayesian inference . . . . . . . . . . . . . . . . . . 67 3.2 MCMC Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.1 Aperiodic-irreducible MCMC Converge to a Unique Station- ary Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.2 Coupling Constructions for Convergence Proofs . . . . . . . . . 72 3.2.3 MCMC Exhibits a Uniform Ergodicity . . . . . . . . . . . . . . 80 3.2.4 MCMC Sometimes Exhibit Geometric Ergodicity . . . . . . . . 83 3.2.5 Central Limit Theorems for Aperiodic-irreducible MCMC . . 90 3.2.6 Noise Benefits in MCMC Algorithms . . . . . . . . . . . . . . 100 3.3 Noisy MCMC Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4 Noisy Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 102 3.4.1 Generalized Noise Benefits Extend the Additive N-MCMC Result 107 3.4.2 Noisy Metropolis-Hastings with Gaussian or Cauchy Jump Den- sities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.4.3 MCMC with Gaussian Jump Densities . . . . . . . . . . . . . . 109 3.4.4 MCMC with Cauchy Jump Densities . . . . . . . . . . . . . . . 112 3.5 The Noisy Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . 114 4 Noisy Markov Simulated Annealing (N-SA) 116 4.1 Classical Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . 116 4.2 Noisy Simulated Annealing Theorems . . . . . . . . . . . . . . . . . . 118 4.2.1 Noisy Simulated Annealing with Convex Increasing Cost-Probability Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 iv 4.3 The Noisy Simulated Annealing Algorithm . . . . . . . . . . . . . . . 125 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.1 Noise improves complex optimization . . . . . . . . . . . . . . 127 4.4.2 Noise benefits in molecular dynamics simulations . . . . . . . . 128 5 Noisy Simulated Quantum Annealing 148 5.1 Quantum Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.2 Quantum Annealing to Solve NP problems . . . . . . . . . . . . . . . . 153 5.3 The Noisy Simulated Quantum Annealing Algorithm . . . . . . . . . . 156 5.3.1 Noise improves quantum MCMC . . . . . . . . . . . . . . . . . 157 6 Future directions 161 6.1 -stable Noisy Simulated Annealing . . . . . . . . . . . . . . . . . . . 161 6.2 Controlled Diusions for Biochemical Optimization . . . . . . . . . . . 161 6.3 Noisy Genetic Algorithms to explore NP-complete Problems . . . . . . 168 References 173 A Bootstrap-based Estimation of the Bell-Curve Tail Thickness of Symmet- ric Alpha-Stable Random Variables 206 A.1 Estimating Symmetric-Stable Tail Thickness . . . . . . . . . . . . . . 206 A.2 The-Stable Map Theorem . . . . . . . . . . . . . . . . . . . . . . . . 209 A.3 TheBEAST Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 A.3.1 Stage 1: Construct-map . . . . . . . . . . . . . . . . . . . . . 219 A.3.2 Stage 2: Estimate from an unknown SS noise source . . . . 221 A.4 Experimental Results for b . . . . . . . . . . . . . . . . . . . . . . . . 223 A.4.1 b compared to other estimators . . . . . . . . . . . . . . . . . . 226 A.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 B Carbon Nanotubes for Increasing Soil Cation Exchange Capacity 234 B.0.1 Carbon nanotubes should increase soil CEC . . . . . . . . . . . 234 B.0.2 Increased CEC benefits plants . . . . . . . . . . . . . . . . . . 239 B.0.3 Corollary Hypothesis: CNTs should benefit industrial clays . . . 243 v List of Tables 2.1 Number of molecules (N = 12) per compartment in simulation state i . . 36 2.2 Noise benefits in one-step Markov chain simulation . . . . . . . . . . . 49 2.3 Noise benefits in one-step Markov chain simulation – unknown error sign 51 4.1 Argon Lennard-Jones 12-6 parameters . . . . . . . . . . . . . . . . . . 146 A.1 Performance of theBEAST algorithm . . . . . . . . . . . . . . . . . . . 223 A.2 Comparison of theBEAST algorithm by . . . . . . . . . . . . . . . . 229 A.3 Comparison of theBEAST algorithm by N . . . . . . . . . . . . . . . . 231 B.1 Ionic Properties. Ca 2+ , and Mg 2+ are similar to Li + . . . . . . . . . . . 235 vi List of Figures 1.1 The Schwefel function is a classical high-dimensional non-linear opti- mization benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Evolution of the 2-dimensional histogram of MCMC samples from the 2-D Schwefel function . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Noise can make the next choice of location x + n more probable . . . . . 6 1.4 Noise improves simulated annealing by forcing the algorithm out of local minimas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Simulated annealing (SA) uses thermal energy to drive the MCMC ran- dom walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 The traveling salesman problem is a classic benchmark test for optimiza- tion algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Quantum annealing (QA) uses tunneling to go through energy peaks. . . 12 1.8 The noisy quantum annealing algorithm propagates noise along the Trot- ter ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.9 Simulated quantum annealing noise benefit in 1024 Ising spin simulation 14 1.10 The two compartment Ehrenfest diusion model is an example of a Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.11 The Wright-Fisher population genetics model shows how Markov chain methods can lead to predictive statistics . . . . . . . . . . . . . . . . . 16 1.12 Symmetric-stable probability density functions and sample realizations 18 2.1 A schematic representation of a finite state Markov chain . . . . . . . . 20 2.2 A reducible Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 A periodic Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Stochastic Resonance on faint images (mandrill and Lena test images) using white Gaussian pixel noise . . . . . . . . . . . . . . . . . . . . . 25 vii 2.5 The non-monotonic signature of stochastic resonance. . . . . . . . . . 26 2.6 Noise benefits in the 2-parameter (Krat-Schaefer) Ehrenfest diusion model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Two compartment Ehrenfest diusion model . . . . . . . . . . . . . . . 38 2.8 Two compartment Krat-Schaefer asymmetric diusion model . . . . . 40 2.9 Noise benefits in Wright-Fisher population genetics . . . . . . . . . . . 42 2.10 Wright-Fisher mating . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.11 Markov dynamics of a Wright-Fisher genotype . . . . . . . . . . . . . 44 2.12 Noise benefits in an empirical chemical network Markov model . . . . . 46 2.13 Zeolite reaction scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.14 Noise benefits in Markov chain density estimation . . . . . . . . . . . . 50 2.15 Multi-cycle noise benefits in Markov chain density estimation . . . . . . 53 3.1 That noise boosts sampling by forcing the sample pdf to more closely resemble the target pdf . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.2 MCMC noise benefits as a function of noise parameters . . . . . . . . . 115 4.1 Schwefel function f (x) = 419:9829d P d i=1 x i sin p jxj is a d-dimensional optimization benchmark on the hypercube512 x i 512 . . . . . . . 128 4.2 Simulated annealing sample sequences from 5-dimensional (projected to 2-D) Schwefel with log cooling schedule show how noise increases breadth of search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3 Simulated annealing noise benefits with 5-dimension Schwefel energy surface and log cooling schedule . . . . . . . . . . . . . . . . . . . . . 130 4.4 Noise benefits decrease convergence time under accelerated cooling sched- ules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.5 Sample RMSD measurements used during scoring . . . . . . . . . . . . 134 4.6 QXP force field model and GRID energy calculation . . . . . . . . . . 135 4.7 The FlexX algorithm partitions major reactive substructures called frag- ments from the ligand . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.8 Six representative compounds that Makhija and Kulkarni uses in the HIV-1 Integrase inhibitor case study. . . . . . . . . . . . . . . . . . . . 143 4.9 Maps indicating spatial sensitivity of the interaction to electrostatic (left) and steric (right) factors . . . . . . . . . . . . . . . . . . . . . . . . . . 144 viii 4.10 Superposition of CoMFA contour plots on active site of HIV-1 . . . . . . 145 4.11 The Lennard-Jones 12-6 potential well approximates pairwise interac- tions between two neutral atoms . . . . . . . . . . . . . . . . . . . . . 146 4.12 MCMC noise benefit for an MCMC molecular dynamics simulation . . 147 5.1 The noisy quantum annealing algorithm propagates noise along the Trot- ter ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.2 Simulated quantum annealing noise benefit in 1024 Ising spin simulation 159 6.1 Impulsive-stable noise improves simulated annealing optimization . . 162 6.2 Brownian motion and Cauchy flights . . . . . . . . . . . . . . . . . . . 163 6.3 Kinesin walking vs. loading . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4 Cauchy noise dispersion and load force plot . . . . . . . . . . . . . . . 165 6.5 Kinesin moves at 8.3nm steps along a microtubule scaold . . . . . . . 166 6.6 Genetic algorithm prototype test function . . . . . . . . . . . . . . . . 169 6.7 The rate of convergence of the genetic algorithm to the cost function . . 171 A.1 Symmetric-stable probability density functions . . . . . . . . . . . . 209 A.2 Impulsive samples from SS random variables with unit dispersion . . . 210 A.3 The-map takes to(X ) and its inverse takes(X ) to b . . . . . . . 219 A.4 g p (X ) for p = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A.5 g p (X ) for p =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A.6 Calculated value of the test statistic for three realizations of an i.i.d. SS random sequence having = 0:5, 1:0, and 2 . . . . . . . . . . . . . . . 222 A.7 Stage 2 of theBEAST algorithm . . . . . . . . . . . . . . . . . . . . . . 223 A.8 TheBEAST algorithm can use a linear map to estimate . . . . . . . . . 224 A.9 b estimated from i.i.d. SS random observations with = 2 (Gaussian), 1 (Cauchy), 0:5, and 0:2 . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.10 Estimated value of for two non-i.i.d. SS random sequences with time dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.11 Small sample (N = 10) comparison ofb BEAST and four standard-estimators228 A.12 A comparison by varying sample size (5 N 80) of b BEAST and four standard-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.13 Small sample (N = 10) comparison ofb BEAST and four standard-estimators232 ix B.1 Carbon nanotubes are covalently linked through an ester bond to the Al 3+ -exchanged clay substrate under heat . . . . . . . . . . . . . . . . 238 x Abstract Noise can improve Markov chain Monte Carlo (MCMC) estimation. This thesis shows that MCMC can exploit noise benefits to improve estimator performance and speed-up convergence. Modern computation predicates itself on solutions to the problem: how does one eciently search a complex high dimensional space. MCMC proposes a sta- tistical answer by considering the reverse question: assuming the solution how can one reach it given any starting point. The success of MCMC turns on the thermodynamic principle called detailed balance. Detailed balance allows the chain to act interchange- ably in the forward or backward direction. The Metropolis Hastings algorithm, the Gibbs sampler, and simulated annealing are special cases of random walk MCMC algo- rithms. MCMC promises some of the most ecient solutions to NP problems. This thesis presents three major theoretical results. The Noisy Markov Chain Monte Carlo Theorem shows that injected noise can lead to better MCMC sampling and reduce burn- in time. The related Markov Chain Noise Benefit Theorem describes a noise benefit criterion that applies generally to all Markov chains.The noise gives the system access to a statistically richer set of otherwise improbable states. The related Markov Chain Noise Benefit Theorem describes a noise benefit criterion that applies generally to all Markov chains. The Noisy Simulated Annealing Theorem shows that noise can also boost MCMC optimization. Simulated annealing (SA) introduces a notion of tempera- ture to constrain estimates within thermally optimal regions of the potential energy sur- face. This thesis describes noisy versions of the Metropolis Hastings algorithm and clas- sical simulated annealing. This thesis also presents the noise boosted quantum anneal- ing algorithm. Quantum annealing (QA) replaces the temperature in classical simulated annealing with probabilistic quantum tunneling. QA uses tunneling to escape local min- ima by burrowing through high energy peaks. The noisy quantum annealing algorithm xi uses noise to improve ground-state energy calculations on high-dimensional energy sur- faces. This thesis closes with a small-sample bootstrap algorithm that can estimate the tail thickness parameter for alpha-stable bell curves. This thesis shows that symmetric alpha-stable noise can lead to substantial MCMC performance benefits. This suggests MCMC algorithms could tune noise tail thickness to further enhance the noise benefit. xii Chapter 1 Noise Benefits in Markov Chain Monte Carlo This thesis shows that noise can improve Markov Chain Monte Carlo (MCMC) estima- tion. MCMC algorithms form a cornerstone of modern statistics and computing. These methods find wide application in the fields of statistical physics, chemistry, pharmacol- ogy, optimization, decision theory, Bayesian inference, particle systems, and finance. MCMC also promises some of the most ecient solutions to NP problems which rede- fine the definition of hard. Modern computation predicates itself on solutions to the problem: how does one eciently search a complex high dimensional space. MCMC proposes a statistical answer by considering the reverse question: assuming the solution how can one reach it given any starting point. The success of MCMC turns on the ther- modynamic principle called detailed balance. Detailed balance allows the chain to act interchangeably in the forward or backward direction. Markov Chain Monte Carlo prescribes structure to draw sample solutions in an e- cient manner. But complex problems lead to large solution spaces. This leads to algo- rithms that need a long time to compute optimal solutions. Careful Noise injection can ameliorate this concern. The Noisy Markov Chain Monte Carlo Theorem (Theorem 3.15) describes a condition that leads to better MCMC sampling and reduces burn-in time. Noise can give the Markov chain system access to a statistically richer set of otherwise improbable states. The Noisy Simulated Annealing Theorem (Theorem 4.1) shows that noise can also boost MCMC optimization. Simulated annealing (SA) estimates the minimum of a potential energy surface. SA constructs a Markov chain confined to thermally optimal regions of the potential energy surface. Quantum annealing (QA) is the modern succes- sor to classical annealing. QA replaces thermal jumping with quantum tunneling. QA uses tunneling to escape local minima by burrowing through high energy peaks. QA can outperform classical annealing in certain types of problems. The thesis presents a 1 noise boosted quantum annealing algorithm. The algorithm shows how quantum noise can further improve QA. These noise benefits results sit apart from the very general Markov Chain Noise Benefit Theorem (Theorem 2.1). Markov chain models appear extensively in chemistry, statistical mechanics, genetics, speech recognition, and finance. All MCMC algorithms rely on the theory underlying Markov chains to provide a mechanism to generate new estimates given the current state. The Markov Chain Noise Benefit Theorem describes a noise benefit criterion that applies generally to all Markov chains. But the form of the noise benefit condition makes it dicult to apply beyond direct applications of Markov chain density evolution. The thesis closes with a small-sample bootstrap algorithm that can estimate the tail thickness parameter for alpha-stable bell curves. The-stable densities generalize the Gaussian density and they appear in a large number of practical settings through the gen- eralized central limit theorems. This thesis shows that symmetric alpha-stable noise can lead to substantial MCMC performance benefits. This suggests that tuning tail thickness may provide a useful means to enhance MCMC the noise benefit. This suggests MCMC algorithms could tune noise tail thickness to further enhance the noise benefit. 1.1 Noise Boosting Markov Chain Monte Carlo The Noisy Markov Chain Monte Carlo (N-MCMC) theorem (3.15) is the major result in this thesis. The theorem gives a sucient condition under which additive noise N leads to a more ecient sampling of the target density. This is the first MCMC noise benefit result and it applies to any random walk Markov Chain Monte Carlo algorithm. MCMC methods pose a statistical solution to the following problem: how does one eciently search a complex high dimensional space. The solution subsumes domain specific questions like “what is the risk that I will develop pancreatic cancer before age 70?” [354, 263], “what did that person most likely say?” in a speech recognition context [163, 158, 83] or, “how long until the mortgage-backed financial bubble bursts?” [157, 104, 159, 97]. MCMC proposes a statistical answer to these questions by considering the reverse problem: assuming the solution how can one reach it given any starting point. The success of MCMC turns on the thermodynamic principle called detailed balance. Detailed balance allows the chain to act interchangeably in the forward or backward 2 direction. MCMC Algorithms proceed by relying on local properties of the space to determine their dynamical behavior. MCMC algorithms generalize the concept of search by assuming that not all points in the sample space are equal. The space contains low probability valleys punctuated by high probability peaks. MCMC navigates the search space in a probabilistic way. MCMC comes equipped with the promise that “all roads lead to Rome”. In high dimen- sions exact solutions to these problem become intractable because it is often not possible to explicitly state the shape of the probability terrain. Monte Carlo methods tackle problems by sampling from a space to construct itera- tive improvements to a probabilistic estimate. Monte Carlo usually employs one of two sampling methods: i.i.d. sampling (i.i.d. MC) or dependent sampling (MCMC). I.I.D. sampling uses a naive uniform sample approach to from the entire space. MCMC sam- pling methods use carefully constructed random walks to explore the probability surface. MCMC algorithms introduce random jumps to ensure that they fuse local knowledge with novel information from new regions of the space. MCMC often works well for problems of high dimension. It is a natural algorithm to apply to many big-data problems since big data can both create and exaggerate “big dimension.” MCMC also complements Expectation Maximization (EM) algorithms as a means to work within the posterior density on the left hand side of the Bayesian infer- ence master equation f (jx)/ g(xj)h() (1.1) Bayesian methods often employ MCMC internally to compute complex expectations and integrals of functions that arise during estimation. MCMC simulation itself arose in the early 1950s when physicists modeled the intense energies and high particle dimen- sions involved in the design of thermonuclear bombs. Many refer to this algorithm as the Metropolis algorithm or the Metropolis-Hastings algorithm after Hastings’ modifi- cation to it in 1970. The original 1953 paper computed thermal averages for 224 hard spheres that collided in the plane. Its high-dimensional state space wasR 448 . So even standard random-sample Monte Carlo techniques were not feasible. These simulations ran on the first ENIAC and MANIAC computers. Special cases of MCMC include the Metropolis-Hastings algorithm and Gibbs sampling in Bayesian statistical inference. 3 (a) noise speeds convergence (b) noise reduces error Figure 1.1: The Schwefel function is a classical high-dimensional non-linear optimiza- tion benchmark It has a single global minimum. The surface contains irregular troughs separated by energy peaks. This punishes search algorithms that emphasize local search. Markov chain Monte Carlo (MCMC) methods work by random walking along the sur- face. MCMC usually only knows the local behavior of a function. MCMC simulations resemble the behavior of a traveler walking along an energy surface. An energy surface formalizes the concept of inverse probability where low 4 (a) 1000 MCMC samples (b) 10,000 MCMC samples (c) 100,000 MCMC samples Figure 1.2: The three panels show evolution of the 2-dimensional histogram of MCMC samples from the 2-D Schwefel function (Figure 1.1). (a) After 1000 samples the sim- ulation explores only a small region of the space. The simulation has not suciently burned-in. The samples remain close to the initial state because the MCMC random walk proposes new samples near the current state. This early histogram does not match the Schwefel density. (b) The 10,000 sample histogram better matches the target density but there are still large unexplored regions. (c) The 100,000 sample histogram shows that the simulation explored most of the search space. The tallest (red) peak indicates that the simulation found the global minimum. Note that the histogram peaks corre- spond to energy minimas on the Schwefel surface. energy valleys correspond to regions of high probability and high energy mountains correspond to regions of low probability. In this analogy the traveler wanders until they become stuck in a valley. They remains in the valley for a time proportional to the height 5 of the energy peaks. Eventually they muster the drive to scale the surrounding hills and the continue their walk until they encounter the next valley. MCMC methods tend to suer from slow convergence for large problems. This is because local minimas tend to attract the estimate. Algorithms in a local high proba- bility region (an energy valley) take many attempts to escape since the MCMC random walk biases the algorithm to sample only nearby states. After each failed attempt the algorithm resets back in to the valley. Serial correlation between samples also describes why MCMC simulations often show strong sensitivity to the initial conditions. Algo- rithms that start near (or in) a valley may require a large number of samples before they can escape and start to explore beyond the initial condition. The term burn-in describes the time where a user discards samples deemed too dependent to the initial conditions. The N-MCMC theorem stems from a simple intuition: Find a noise sample n that makes the next choice of location x + n more probable. The N-MCMC algorithm uses noise to bring each Markov step closer on average to the equilibrium pdf. It relies on the idea that an MCMC algorithm can sometimes increase the similarity between an individual sample x t and the target density. Target Jump ݔ ௧ (a) Regular MCMC, P escape 0 ݔ ௧ ݊ (b) Noisy MCMC, P escape > 0 Figure 1.3: Noise can make the next choice of location x + n more probable. The N- MCMC algorithm uses noise to bring each Markov step closer on average to the equi- librium pdf. The estimate is trapped in a local at time t. The lower pink curve represents the jump distribution centered on the current state. The shaded area indicates the most likely next state candidate. (a) The jump distribution is not likely to propose a state outside of the local minimum. Thus it will take several consecutive disfavored jumps to escape the trough. (b) Noisy MCMC uses injected noise to perturb the current state prior to jumping. Thus the noisy simulation is more likely to escape the minima because it jumps from a normally disfavored position. If the N-MCMC simulation does not accept the new candidate the state remains at x t and not x t + n. 6 Theorem 1.1 (Noisy Markov Chain Monte Carlo Theorem). An MCMC noise benefit occurs on average if E N;X " ln Q(yj x t + N) Q(yj x t ) # E N " ln (x t + N) (x t ) # : (1.2) The theorem takes advantage of detailed balance between Q and which MCMC algorithms generally have built into them. The condition demonstrates that noise bene- fits can emerge by solving an inverse problem: how can one condition noise to increase the probability of a particular event. The N-MCMC theorem back-solves a non-causal noise benefit statement and then uses detailed balance to express the condition in terms of the current state. Figure 1.4 shows how noise can assist search by encouraging the algorithm to sample from novel high probability regions. (a) Without noise (b) With noise Figure 1.4: Noise improves simulated annealing (SA) by forcing the algorithm out of local minimas. Noisy simulated annealing (SA) visits more local minimas and quickly moved from same minimas that trapped non-noisy SA. The red circle (lower left) indi- cates the global minimum. The noisy algorithm settles 1 hop away from the global minimum. (a) Thermal noise is not enough to induce the noiseless algorithm to search the space beyond three local minimas. (b) Noisy simulated annealing uses noise boosted thermal jumps to increase the search breadth. It visits the same three minimas as (a) but it searches for a local optimum for only a few hundred steps before jumping to the next minima. Chapter 3 presents several corollaries to the the N-MCMC theorem. The first corol- lary generalizes the result from additive noise x t +N to arbitrary modes of noise injection 7 g(x t ; N). The chapter concludes with corollary results that derive simple algebraic con- ditions assuming the two most common jump densities: finite variance Gaussian and infinite variance Cauchy. 1.2 Noise Enhances Simulated Annealing Optimization The Noisy Simulated Annealing (N-SA) theorem (Theorem 4.1) is the next major result in this thesis. Suppose we want to find the global minimum for a cost function C (x). Simulated annealing (SA) uses a time-dependent Markov Chain Monte Carlo procedure with the specific intent of minimizing or maximizing a function. MCMC sampling algorithms generally construct a time independent Markov Chains. The intuition behind SA is that by constricting the search in a very particular way ensures that the algorithm will eventually find the global minimum (or maximum) and remain there. SA recasts the Metropolis-Hastings formulation as a thermodynamic hopping pro- cess. Simulated annealing maps the target cost function C (x) to a potential energy surface through the Boltzmann factor e p(x t )/ exp " C (x t ) kT # (1.3) and then performs the Metropolis-Hastings algorithm with e p(x t ) in place of the prob- ability density p(x). Figure 1.5 illustrates how SA uses a notion of slowly decreasing temperature to constrain samples to decreasing energy regions. For temperatures near absolute zero the algorithm only permits state changes that lower the estimates energy. The current state becomes locked within successively lower energy minimas as the algo- rithm proceeds. MCMC algorithms require closed and bounded sample spaces and so the decreasing energy estimates eventually attain the global optimum. Note that the algorithm can optimize for functions of the global minimum by considering the mini- mum of functions of C (x) such as find the maximum of C (x) by finding the minimum ofC (x). SA algorithms use a cooling schedule to update the temperature. The algorithm provably attains a global minimum in the t!1 limit but this requires an extremely slow T (t)/ log(t + 1) cooling. Accelerated cooling schedules such as geometric T (t)/ t or 8 Figure 1.5: Simulated annealing (SA) uses heat to drive the MCMC random walk [208]. (left) SA begins with a relatively high temperature. This imparts high thermal energy to virtual particle that represents the current state. This allows the algorithm to access high energy states. This corresponds to scaling large energy peaks on the potential energy surface. (middle) SA reduces the temperature as time increases. The callout shows that the white particle became trapped in a deep local minima. The other two particles found shallow local minimas. But there is still sucient thermal energy to continue the search. (right) SA reduces the temperature further. As the temperature approaches absolute zero particles cannot make any jumps except those that reduce the system energy. In this pane the white particle represents an estimate will likely never emerge from the deep suboptimal energy solution. Simulated Annealing guarantees that all estimates will find the global energy if the algorithm uses a cooling schedule that decreases temperature at most logarithmically. exponential T (t)/ exp d p t often yield satisfactory approximations in practice. This means that an increase in estimator power scales as a time power law. The Noisy Simulated Annealing Theorem stems from a simple intuition: Find a noise sample n that increases acceptance probability of the next choice of location. The NSA theorem shows that noise can encourage a SA algorithm to selectively sample high probability regions. This in turn lead to jumps that increase the search breadth and ultimately increases the optimization speed. The NSA theorem proves that noise can boost the acceptance probability of new samples subject to a positivity condition. 9 Figure 1.6: The traveling salesman problem is a classic benchmark test for optimization algorithms [337]. Each node corresponds to a city. The optimal solution to the problem finds the shortest route such that it visits every city exactly once. This example shows how an annealing algorithm searches the vast solution space in a 500 city problem. The immense solution space contains 499! (2:4403 10 1132 ) possible solutions since the problem solution is an ordered list of the 500 cities (500!) but without regard for the particular starting city (divide by 500). (left) SA algorithms tackle search problems by randomly guessing a starting estimate. (center) The algorithm then makes minor changes to the estimate and accepts any that improve the overall estimate. This corre- sponds to decreasing the total path length in this example. SA has a special procedure to occasionally accept less optimal solutions. This ensures it can escape from suboptimal local minimas. (right) The SA annealing algorithm found the optimal solution from the vast search space. Theorem 1.2 (Noisy Simulated Annealing Theorem). A simulated annealing noise ben- efit occurs on average if E N " ln (x t + N;T) (x t ;T) # 0: (1.4) This theorem provides a complementary condition to the N-MCMC inequality. The theorem demonstrates for the first time how a simulated annealing algorithm can use noise to boost its sampling eciency. The noise acts to increase the acceptance prob- ability of samples along the SA random walk. The intuition is that by perturbing the walk from local minimas the algorithm is more likely to make jumps to regions of the state space that it could not reach from the minima (see Figure 1.4). The eect is that the random walk explores more of the space. This in turn increases the speed of the algorithm to optimize the surface. Chapter 4 describes the noisy simulated annealing algorithm (Algorithm 4.1). The chapter presents simulations that show noisy SA leads to faster convergence and 10 decreased estimator variance. A comparison of SA and N-SA on the 5-D Schwefel global benchmark problem (Figure 1.1) shows that Noisy SA converges to the global minimum in 76% fewer steps than SA algorithm. It also shows that noise decreased the estimate error by two orders of magnitude (from an SA average of 4.7 to under 0.07 with N-SA). The comparison shows further that even a small amount tuned noise can reduce failures to find the optimum to 0.2% from 4.5% in noiseless SA. Another series of simulations shows how the noise impulsiveness can amplify the noise benefit. -stable random variables generalize the finite variance Gaussian bell curve. Thick-tailed -stable pdfs find many applications in physics and engineering where thicker tails can model energetic or impulsive processes. An appendix to this the- sis presents a bootstrap-driven BEAST (Bootstrap Estimate of Alpha STable) algorithm as a practical way to estimate from small or large data sets. 1.3 Noisy Quantum Annealing The Noisy Quantum Annealing Algorithm is the next major result in this thesis. Quan- tum annealing (QA) uses quantum forces to evolve the state according to the quantum Hamiltonian instead of thermodynamic excitation in (classical) simulated annealing. Simulated quantum annealing uses an MCMC framework to simulate draws from the square of the wave function instead of solving the time-dependent Schrödinger equa- tion: i¯ h @ @t (r;t) = " ¯ h 2 2 r 2 + V (r;t) # (r;t): (1.5) The classical simulated annealing acceptance probability is proportional to the ratio of a function of the energy of the old and new states. This can prevent beneficial hops if there are energy peaks between minimas. Figure 1.7 shows how QA uses probabilistic quantum tunneling to allow occasional jumps through high energy peaks. QA introduces a transverse magnetic field in place of temperature in classical SA. The strength of the magnetic field governs the transition probabilities of the system. 11 classical (over) Energy (lower is better) Ensemble state quantum (through) local minimum (not optimal) global minimum (optimal) Figure 1.7: Quantum annealing (QA) uses tunneling to go through energy peaks (yel- low). Compare this to classical simulated annealing (SA) that must generate a sequence of states to scale the peak (blue). This example shows that a local minima has trapped estimate (green). SA will require a sequence of unlikely jumps to scale the potential energy hill. This might be an unrealistic expectation at low SA temperatures. This would trap the estimate in the suboptimal valley forever. QA uses quantum tunneling to burrow through the mountain. This illustrates why QA often produces far superior esti- mates over SA while optimizing complex potential energy surfaces that contain many high energy states. The adiabatic theorem ensures that the system remains near the ground state during slow changes of the magnetic field. Adiabatically decreasing the temperature H (t) = 1 t T H 0 + t T H P (1.6) 12 then recovers the minimum energy configuration of the potential energy surface as time t approaches a fixed large value T. Quantum annealing outperforms classical simulated annealing in cases where the potential energy landscape contains many high but thin energy barriers between shal- low local minima. It is specially suited to problems in discrete search spaces with a huge number local minimas such as finding the ground state of an Ising spin glass. Recent research shows that each of Karp’s 21 NP-complete problems map to equivalent Ising formulations. NP-complete problems are a special class of decision problem that have time complexity super-polynomial (NP-hard) to the input size but only polynomial time to verify the solution (NP). The NP-complete problems include many optimization benchmarks such as graph-partitioning, finding an exact cover, integer weight knapsack packing, graph coloring, and traveling salesman. Advances by D-Wave Systems have brought quantum annealers to market and shown how adiabatic quantum computers are suitable for solving real world applications. Open problem: how does this abstraction tie back to true quantum annealing noise noise Trotter slice n‐1 Trotter slice n Trotter slice n+1 Figure 1.8: The noisy quantum annealing algorithm propagates noise along the Trotter ring. After each time step the algorithm inspects the local energy landscape. It injects noise in the form of conditionally flipping the spin of neighbors. This in turn diuses the noise across the network because quantum correlations between the neighbors encour- age convergence to the optimal solution. Chapter 5 presents the Noisy Quantum Annealing Algorithm. Figure 1.8 shows how the noisy quantum annealing algorithm injects noise into the system via quantum correlated neighbors. It continues with simulations that show that noise that obeys a condition similar to the N-MCMC theorem improves the ground-state energy estimate in 13 simulated Quantum annealing. Figure 1.9 shows results from a noisy QA simulation that used noise boosted path-integral Monte Carlo (PIMC) quantum annealing to improve the estimated ground state of a 1024-spin Ising spin glass system by 25.6%. ‐1800.000 ‐1600.000 ‐1400.000 ‐1200.000 ‐1000.000 ‐800.000 ‐600.000 ‐400.000 ‐200.000 0.000 0.000 0.010 0.020 0.030 0.040 0.050 energy, E Noise power theorem blind ground state ‐1800 ‐1600 ‐1400 ‐1200 ‐1000 ‐800 ‐600 ‐400 ‐200 0 0 0.01 0.02 0.03 0.04 0.05 energy, E Noise power theorem blind ground state ‐1650 ‐1450 ‐1250 ‐1050 ‐850 ‐650 ‐450 ‐250 ‐50 0 0.01 0.02 0.03 0.04 0.05 Energy Noise power N‐MCMC theorem Blind true ground state Noise benefit Estimate error Figure 1.9: Simulated quantum annealing noise benefit in 1024 Ising spin simulation: the pink line shows that noise improves the estimated ground-state energy of a 32x32 spin lattice by 25.6%. This plot shows the ground state energy after 100 path-integral Monte Carlo steps. The true ground state energy (red) is E 0 =1591:92. Each point is the average calculated ground state from 100 simulations at each noise power. The blue line shows that blind (iid sampling) noise does not provide benefit to the simulation. This shows that the N-MCMC condition is central to the S-QA noise benefit. 1.4 Markov Chain Noise Benefits The Markov Chain Noise Benefit Theorem (Theorem 2.1) is the next major result in this thesis. It shows that noise can speed convergence to equilibrium in a discrete finite- state Markov chain M with Markov transition matrix P. Markov chains formalize ran- dom processes that undergo transitions from one state to another. The Markov property 14 imbues the chain with memorylessness so that the next state depends only on the current state regardless of every state that preceded it. Markov chain models appear exten- sively in chemistry, statistical mechanics, genetics, speech recognition, and finance. Figure 1.10 and Figure 1.11 show how Markov chains relate naturally to models in diusion theory and genetics. All MCMC algorithms rely on the theory underlying Markov chains to provide strong guarantees on convergence and limit the state informa- tion algorithms must track. (a) symmetric diusion (b) asymmetric diusion Figure 1.10: The two compartment Ehrenfest diusion model is an example of a Markov chain The box contains N = 20 molecules. The compartments A and B partition the box. (a) The model randomly selects a molecule at each time step (red circle) and moves the selected molecule to the other compartment (red arrow). This represents a Markov chain since the number of molecules in A and B in the next instant depend only upon how many molecules are in A and B now. (b) The figure on right shows how a model can introduce additional dynamics but retain the Markov property. In this figure the partition is asymmetric. It prefers to move molecules from B to A. The model is still a Markov chain because it depends only on the number of molecules A and B at a given time. The Markov Chain Noise Benefit Theorem shows how to construct a normalized state density at each time step for a Markov chain. The theorem was the first to suggest a noise benefit within the very general Markov chain structure. The idea is that an algo- rithm could use noise to perturb the state within a Markov chain and speed convergence to the steady state. Such noise leads to faster convergence because the noise reduces the norm components. The noise appears to give the Markov chain system access to a statistically richer set of otherwise improbable states. 15 t=0 t=1 t=2 t=3 t=4 Figure 1.11: The Wright-Fisher population genetics model shows how Markov chain methods can lead to predictive statistics. A Markov chain drives the evolution of the Wright-Fisher genetic model. Each of the 6 circles for t = 0 represents an allele for a particular gene (blue = A 1 and red = A 2 ). The Wright-Fisher model generates the t = 1 ospring by randomly sampling the t = 0 population with replacement. The connections indicate the surviving genes and their ospring. Markov chain result guarantees that this process converges the same final probability of red and blue alleles independent of the initial conditions. Theorem 1.3 (Markov Chain Noise Benefit Theorem). There exists a noise benefit for each non-stationary state density vector x: e xP x 1 i < xP x 1 i (1.7) for all states i with i = x x 1 P i , 0 (1.8) The theorem includes the steady state density x 1 as part of the condition but this is not in general known a priori. The thesis then presents two noisy Markov chain algorithms that arise from the Markov Chain Noise Benefit Theorem. Algorithm 2.1 shows the ideal case where the user knows or has an estimate of the density. Algorithm 2.2 covers the more typical blind case where the user does not know the density steady state density x 1 . The blind Noisy MC algorithm introduces a drift term to connect noise benefits from one time step to the next current time step. A comparison using a diusion 16 model shows that both Markov chain noise benefit algorithms improved performance over noiseless models by 25%. The comparison shows also that the behavior of the optimal noise benefit algorithm is more stable than the blind algorithm. The chapter then concludes with two more simulations that demonstrate the application of the Markov Chain Noise Benefit Theorem to population genetics and complex chemical networks. 1.5 Estimating Bell Curve Tail Thickness in Symmetric -Stable Samples The last major result in this dissertation is the -Stable Map Theorem. Most random models assume that the dispersion of a random variable equals its (finite) variance or its mean-squared deviation from the population mean. Impulsive signals and noise vio- late this finite-variance assumption in general. -stable random variables generalize the finite variance Gaussian bell curve (Figure 1.12). Thick-tailed -stable pdfs find many applications in physics and engineering where thicker tails can model energetic or impulsive processes. One main problem with using stable pdfs in physical models is: how does a user estimate alpha from sample data? The-Stable Map Theorem (Theorem A.1) and an invertibility corollary (Corollary A.1) show how to construct a statistic from-stable samples that maps to an estimate of the underlying tail thickness parameter. Theorem 1.4 (-Stable Map Theorem). Suppose X 1 ;k and X 2 ;k are two independent sequences of n i.i.d. SS random variables with probability density functions f 1 (x) and f 2 (x). Suppose 1 ; 2 2 (0;2] with unit dispersion: = 1. Fix p> 1 and define g p (X) as g p (fX k g) = 1 n kfX k gk p p = 1 n n X k=1 jX k j p : (1.9) Then there exists an n 0 and H such that E h g p X 1 jmax n X 1 ;k o = h i = E h g p X 2 jmax n X 1 ;k o = h i (1.10) for h H and all sequences X 1 and X 2 with length n n 0 only if 1 = 2 . 17 (a) symmetric-stable pdfs (b) symmetric-stable sample realizations Figure 1.12: (a) Symmetric -stable probability density functions. The figure shows SS probability density functions for = 2 (Gaussian), 1:4 (super-Cauchy), 1 (Cauchy), and 0:4 (sub-Cauchy). The thickness of the bell-curve tails increases as decreases. Thicker tails correspond to more impulsive sample with more energetic fluctuations. The Gaussian bell curve is the only SS probability density function with finite moments of order k 2. (b) Impulsive samples from SS random variables with unit dispersion. The figure shows SS realizations for = 2 (Gaussian), 1:4, 1:0 (Cauchy), and 0:4. The scale diers by two orders of magnitude between = 2 and = 1. The scale diers by four orders of magnitude between = 1 and = 0:4. Only the Gaussian samples have finite variance and no impulsiveness. The theorem and a corollary act in concert to construct a bijection from a function of the samples to. This dissertation describes the bootstrap-driven BEAST (Bootstrap Estimate of Alpha STable) algorithm as a practical way to estimate from small or large data sets. Comparisons with four other standard -estimators show that the algorithm matches or beats the performance of each with small samples (N = 10) but that the estimators become indistinguishable as the number of samples increases. The thesis ends with discussions of ongoing work to establish noise benefits in several MCMC applications including Gibbs Sampling and Particle Filters. It also posits that Noisy MCMC methods may provide a framework to understanding com- plex biochemical and chemical problems in the form of the intracellular kinesin ratchets along microtubules and carbon nanotube enhanced cation exchange capacity of soils and clays. 18 Chapter 2 Noise Can Speed Convergence in Markov Chains This chapter shows that noise can speed convergence to equilibrium in discrete-time finite-state Markov chains. The noise applies to the state density and helps the Markov chain explore improbable regions of the state space. The noise gives the Markov chain system access to a statistically richer set of otherwise improbable states. The Markov Chain Noise Benefit Theorem shows that a stochastic-resonance noise benefit exists for states that obey a vector-norm inequality. Judiciously adding noise directly to the state density speeds up the convergence time for the Markov chain sim- ulation depending on the direction of the inequality. Such noise leads to faster conver- gence because the noise reduces the norm components. A corollary shows that a noise benefit still occurs if the system states obey an alternate norm inequality. This leads to a noise-benefit algorithm that requires knowledge of the steady state. An alternative blind algorithm uses only past state information to achieve a weaker noise benefit. Simulations illustrate the predicted noise benefits in three well-known Markov models. The first model is a two-parameter Ehrenfest diusion model that shows how noise benefits can occur in the class of birth-death processes. The second model is a Wright-Fisher model of genotype drift in population genetics. The third model is a chemical reaction network of zeolite crystallization. A fourth simulation shows a convergence rate increase of 64% for states that satisfy the theorem and an increase of 53% for states that satisfy the corollary. A final simulation shows that even suboptimal noise can speed convergence if the noise applies over successive time cycles. Noise benefits tend to be sharpest in Markov models that do not converge quickly and that do not have strong absorbing states. 19 2.1 Review of Markov Chains Markov chain simulations employ a stochastic discrete time model to estimate the prob- ability density over a system’s state-space. Suppose M is a time-homogeneous Markov chain over a finite state-space with N< 1 states [294, 139, 242]. Let the N 1 column-vector x(t) represent the state of the Markov chain at time t. Each component x i (t) represents the probability that the chain is in the corresponding state i at time t. Then N X i=1 x i (t) = 1 (2.1) for all t because x(t) is a probability density over the N states. Figure 2.1 shows a schematic representation of a finite state Markov chain with the size of each state indi- cating the relative state occupancy probability. Figure 2.1: A schematic representation of a finite state Markov chain. The size of each state indicates the relative state occupancy probability. The arrows between states indi- cate permissible state transitions. This Markov chain represents a set of webpages and the size indicates the relative importance of each page. The transition arrows indicate links between pages. The pages with the most links from the most important siblings are the most important. 20 Let the P represent the single-step state transition probability matrix where P i; j = P(x t+1 = jj x t = i) (2.2) is the probability of the chain in state i at time t moving to state j at time t + 1. Then there exists a stationary vector x 1 such that x 1 = x 1 P (2.3) [242]. So x 1 is always a left eigenvector of the transition probability matrix P that corresponds to the eigenvalue = 1. The n-step transition probability matrix P (n) has entries P (n) i; j = P(X t+n = jj X t = i) (2.4) = N X k=0 P(X t+n = jj X t = i; X t+1 = k) P(X t+1 = kj X t = i) (2.5) = N X k=0 P(X t+n = jj X t+1 = k) P(X t+1 = kj X t = i) (2.6) = N X k=0 P (n1) k; j P i;k (2.7) where P (n) i; j is the probability that the chain transitions from state i to state j in exactly n time steps. State j is accessible from state i if there is some non-zero probability of transitioning from state i to state j in any number of steps: P (n) i; j > 0 (2.8) for some n> 0. 21 A Markov chain is irreducible if every state is accessible from every other state [294, 242]. Irreducibility implies that for all states i and j there exists m> 0 such that P(X n+m = jj X n = i) = P (n) i; j > 0: (2.9) This is equivalent to P is a regular stochastic matrix if M is a finite Markov chain. Figure 2.2 shows an example of a reducible finite Markov chain. State 7 is not accessible from state 8. Figure 2.2: An example of a reducible Markov chain. Only state 6 is accessible from state 8. Once the Markov chain transitions to state 8 it cannot transition to state 7 in any number of steps. Markov chain algorithms can deal with these issues by adding transitions with vanishingly small probabilities between inaccessible states. The period d i of state i is d i = gcd n n 1 : P (n) i;i > 0 o (2.10) or d i =1 if P (n) i;i = 0 for all n 1 where gcd denotes the greatest common divisor. State i is aperiodic if d i = 1. A Markov chain with transition matrix P is aperiodic if d i = 1 for all states i. Figure 2.3 shows an example of a periodic Markov chain. 22 Figure 2.3: An example of a periodic Markov chain. The Markov chain state transition matrix has eigenvalue =1 so it is periodic. One means of solving periodicity is to add self-transition on at least one state. Suppose a Markov chain M is irreducible and aperiodic. Then the fixed point x 1 is unique and lim k!1 P (k) = 1 x 1 (2.11) where 1 is the column vector with all entries equal to 1 [139, 319]. The outer product generates a rank-one NN matrix with each column equal to the stationary state density. The next section (§2.2) reviews concepts behind stochastic resonance (SR) and the more general concept of noise-benefits. It then continues with noise-benefits in the context of Markov chain algorithms and illustrates the intuition behind the benefit. 2.2 Noise Benefits and Stochastic Resonance Stochastic resonance is a characteristic type of noise benefit: small amounts of noise improve system performance while too much noise degrades it. Many nonlinear signal systems benefit from adding small amounts of noise [197, 145, 237, 365, 210, 268, 198, 271, 272, 6, 38, 141, 305, 270]. A noise benefit occurs when noise improves a signal system performance. Too little noise produces little or no benefit while too much noise can swamp the system’s performance. This so-called “stochastic resonance” eect can take the form of an increased signal-to-noise ratio [245, 307, 142], entropy-based 23 bit count [236, 246, 274], input-output correlation [267], Fisher information [52, 51], probability of detection [55, 181], and cross-correlation [63, 65]. The noise benefit for a simulated Markov chain is a shorter time to converge to the equilibrium probability density in the sense that the noise reduces the vector norm of the error. Benzi [21] first characterized a noise benefit by showing that some dynamical sys- tems subject to periodic forcing and random perturbation may show stochastic reso- nance. Resonance in this context refers to a peak in the power spectrum which is absent when either input acts alone. Early investigations of stochastic resonance focused on natural systems in biology, chemistry, and physics. Research in SR has grown from the study of external periodic forcing signals to the study of more complex dynamical systems [44, 50, 56, 63, 64, 65, 62, 149, 114, 221, 275, 331]. Figure 2.4 [245, 201] shows a typical noise benefit. The figure shows noiseless versions of three standard test images on the left (mandrill, Lena, and Elane) after pass- ing through a suboptimal binary threshold filter. Images to the right use increasing power () of Gaussian pixel noise. With weak noise (the second column) details in the faint images become clear. Further increasing noise power (third and fourth columns) degrades the signal beyond recognition. The example shows how certain types of noise can improve non-linear signal detection. Noise typically leads to this type of non-monotonic eect on system performance as a function of noise power. The classical SR signature is a signal-to-noise ratio (SNR) that is not monotone Figure 2.5 shows the classical non-monotonic stochastic resonance eect. The figure shows the SR eect for a quartic-bistable dynamic system. The SNR rises to a maximum and then falls as the variance of the additive white noise increases. More complex systems my have multimodal SNR’s and show stochastic multiresonance [113, 351] 2.3 Noise Benefits in Markov Chain Density Estimation Markov chains form a basis for powerful Markov chain Monte Carlo (MCMC) statistical simulations [294]. MCMC methods generate samples from a given posterior probability density function by constructing a Markov chain whose stationary density equals the posterior of interest [248, 341]. The Metropolis-Hastings algorithm [240, 146] and Gibbs samplers [119, 117] are special and powerful MCMC frameworks that compute 24 Figure 2.4: Stochastic Resonance on faint images (mandrill and Lena test images) using white Gaussian pixel noise. The images are faint because the gray-scale pixels pass through a suboptimal binary threshold. The faint images (leftmost panels) become clearer as the power of the additive white Gaussian pixel noise increases. But increas- ing the noise power too much degrades the image beyond recognition (rightmost panels). Bayesian statistics. But MCMC methods suer from problem-specific parameters that govern sample acceptance and convergence assessment [335, 127]. Strong dependence on initial conditions also biases the MCMC sampling unless the simulation allows a lengthy period of “burn-in” to allow the driving Markov chain to mix adequately [69, 294]. The Markov Chain Noise Benefit Theorem in the next section shows how to con- struct a normalized state density at each time cycle for a finite time-homogeneous Markov chain with an irreducible and aperiodic state transition matrix. The theorem and corollary guarantee the existence of a component-wise noise benefit that decreases the time to convergence. They show that simulations can perturb the current state of a Markov chain to explore novel regions in the state space and speed convergence to the steady-state distribution. The form of the noise depends on the direction of a state- related inequality. The theorem may ensure only minimal benefits for systems that exhibit fast convergence or that possess strong absorbing states. Section III presents two algorithms that use the Markov Chain Noise Benefit Theo- rem to obtain a noise benefit. The first algorithm shows how the simulation can obtain 25 Figure 2.5: The non-monotonic signature of stochastic resonance. The graph shows the smoothed output signal-to-noise ratio of a quartic bistable system as a function of the standard deviation of additive white Gaussian noise. The vertical dashed lines show the absolute deviation between the smallest and largest outliers in each sample average of outcomes. The system has a nonzero noise optimum and thus shows the SR eect. The noisy signal-forced quartic bistable dynamical system has the form ˙ x = f (x) + s(t) + n(t) = x x 3 + sin! 0 t + n(t) with binary output y(t) = sgn(x(t)). The Gaussian noise n(t) adds to the external forcing narrowband signal s(t) = sin! 0 t. an optimal noise benefit. The second algorithm describes how to obtain a noise ben- efit that uses only the current and past state of the Markov chain. A key limitation in applying this result to MCMC is that the system does not usually have direct access to the current state vector during the MCMC simulation. Table 2.3 shows that systems can still benefit from noise even without direct access to the state vector. Suitable guesses for the sign of the inequality should help further overcome this limitation in practice. 26 2.4 Markov Chain Noise Benefit Theorem The Markov Chain Noise Benefit Theorem shows that Markov chain simulations can benefit from noise through faster convergence. The theorem shows that there is a component-wise noise benefit for any component that has not yet converged to its sta- tionary value. The theorem assumes that the sign of a state-related inequality is in one of two directions. The corollary assumes it is in the other direction. Theorem 2.1. Suppose M is a finite time-homogeneous Markov chain with N states and transition matrix P. Suppose further that M is irreducible and aperiodic. Then for all non-stationary state density vectors x there exists a noise benefit in the sense that there exists some A> 0 so that for all a2 (0; A): e xP x 1 i < xP x 1 i (2.12) for all states i with i = x x 1 P i > 0 (2.13) where e x = 1 1+ a (x + n) (2.14) is the normalized state vector after adding a noise vector n with only one non-zero component: n j = 8 > > > < > > > : a j = k 0 j, k (2.15) for any k that satisfies k = x x 1 P k > 0: (2.16) Proof. Fix x as a state vector of the Markov chain M. Note first that e x is a probability density function over the states of M because of (a) and (b) below: 27 (a). e x is a N-vector with e x i 0 since e x i = " 1 1+ a (x + n) # i (2.17) = 8 > > > < > > > : 1 1+a x i i, k 1 1+a (x i + a) i = k (2.18) 1 max(1; A) x i (2.19) 0 (2.20) since a> 0 and A> 0 (b). P e x i = 1 since N X i=1 e x i = 1 1+ a 0 B B B B B B @ N X i=1 x i + N X i=1 n i 1 C C C C C C A (2.21) = 1 1+ a (1+ a) (2.22) = 1: (2.23) Note that e xP x 1 i < xP x 1 i (2.24) = xP x 1 P i (2.25) = x x 1 P i (2.26) =j i j: (2.27) The proof proceeds by showing that such an A i exists for each component i that satisfies i = (x x 1 ) P i > 0. This will complete the proof because (0; A) =\ N i=1 (0; A i ),; since N<1 and[(0; A i )]> 0 for each A i . Let i in 1 i N be any state that satisfies the inequality i = (x x 1 ) P i > 0. Choose k in 1 k N and define e x = 1 1+ a i (x + n) (2.28) 28 with n j = 8 > > > < > > > : a i j = k 0 j, k (2.29) and a i > 0. Then e xP x 1 i = e xP i x 1 i (2.30) = e xP i x 1 P i (2.31) since x 1 = x 1 P. Expand e x: e xP x 1 i = n X j=1 e x j P j;i n X j=1 x 1 j P j;i (2.32) = n X j=1 " 1 1+ a i (x + n) # j P j;i n X j=1 x 1 j P j;i (2.33) = n X j=1 1 1+ a i x j + n j P j;i n X j=1 x 1 j P j;i (2.34) = 1 1+ a i n X j=1 x j P j;i + n j P j;i n X j=1 x 1 j P j;i (2.35) 29 Then add 0 = a i 1+a i P x j P j;i a i 1+a i P x j P j;i and group: e xP x 1 i = 1 1+ a i n X j=1 x j P j;i + 1 1+ a i n X j=1 n j P j;i + 0 B B B B B B B @ a i 1+ a i n X j=1 x j P j;i a i 1+ a i n X j=1 x j P j;i 1 C C C C C C C A n X j=1 x 1 j P j;i (2.36) = n X j=1 x j P j;i + 1 1+ a i n X j=1 n j P j;i a i 1+ a i n X j=1 x j P j;i n X j=1 x 1 j P j;i (2.37) = 0 B B B B B B B @ n X j=1 x j P j;i + n X j=1 x 1 j P j;i 1 C C C C C C C A 1 1+ a i 0 B B B B B B B @ a i n X j=1 x j P j;i n X j=1 n j P j;i 1 C C C C C C C A (2.38) = xP i x 1 P i 1 1+ a i 0 B B B B B B B @ a i n X j=1 x j P j;i n X j=1 n j P j;i 1 C C C C C C C A (2.39) = i 1 1+ a i 0 B B B B B B B @ a i n X j=1 x j P j;i a i P k;i 1 C C C C C C C A (2.40) = i a i 1+ a i 0 B B B B B B B @ n X j=1 x j P j;i P k;i 1 C C C C C C C A : (2.41) So e xP x 1 i = i a i 1+ a i xP i P k;i : (2.42) 30 Now i > 0 by hypothesis. Thus i a i 1+ a i xP i P k;i <j i j (2.43) if and only if a i 1+ a i xP i P k;i > 0 (2.44) and a i 1+ a i xP i P k;i < 2 i (2.45) sincej i j>j i bj if and only if 0< b< 2 i . The positivity constraint (2.44) holds if and only if xP i > P k;i . The upper bound (2.45) holds if and only if a i xP i P k;i < 2 i (1+ a i ): (2.46) Therefore (2.45) holds if and only if a i xP i P k;i 2 i < 2 i : (2.47) If 2 i < xP i P k;i then a i < 2 i xP i P k;i 2 i (2.48) and if 2 i > xP i P k;i then a i > 2 i xP i P k;i 2 i : (2.49) But if 2 i > xP i P k;i then 2 i xP i P k;i 2 i < 0. So any a i > 0 suces. Thus either a i > 0 if 2 i > xP i P k;i (2.50) or a i < 2 i xP i P k;i 2 i if 2 i < xP i P k;i : (2.51) 31 Therefore if a i 2 (0; A i ) with A i = 2 i xP i P k;i 2 i > 0 then (2.50) and (2.51) hold. So if A = min i fA i g> 0 then the theorem holds for all states i that satisfy the inequality i = (x x 1 ) P i > 0. The following corollary provides a complementary result when the converse of inequality (2.13) holds ( i < 0) in the Markov Chain Noise Benefit Theorem. Corollary 2.1. Suppose the hypotheses of the Markov Chain Noise Benefit Theorem hold. Then there exists a noise benefit for each non-stationary state density vector x in the sense that there exists some A> 0 so that for all a2 (0; A): e xP x 1 i < xP x 1 i (2.52) for all states i with i = x x 1 P i < 0: (2.53) Proof. The i sign change does not aect the expansion in the proof of the theorem. So e xP x 1 i = i a i 1+ a i xP i P k;i (2.54) holds. Now i < 0 by hypothesis. Thus i a i 1+ a i xP i P k;i <j i j (2.55) if and only if a i 1+ a i xP i P k;i > 2 i (2.56) and a i 1+ a i xP i P k;i < 0 (2.57) sincej i j>j i bj if and only if 2 i < b< 0. The negativity constraint (2.57) holds if and only if xP i < P k;i . The lower bound (2.56) holds if and only if a i xP i P k;i > 2 i (1+ a i ): (2.58) 32 Therefore (2.56) holds if and only if a i xP i P k;i 2 i > 2 i : (2.59) If 2 i < xP i P k;i then a i > 2 i xP i P k;i 2 i (2.60) and if 2 i > xP i P k;i then a i < 2 i xP i P k;i 2 i : (2.61) But if 2 i < xP i P k;i then 2 i xP i P k;i 2 i < 0. So any a i < 0 suces. Thus either a i < 0 if 2 i > xP i P k;i (2.62) or a i > 2 i xP i P k;i 2 i if 2 i < xP i P k;i : (2.63) Therefore if a i 2 (A i ;0) with A i = 2 i xP i P k;i 2 i > 0 then (2.62) and (2.63) hold. So if A = min i fA i g> 0 then the theorem holds for all states i that satisfy the inequality i = (x x 1 ) P i < 0. 2.5 Markov Chain Noise Benefit Algorithms This section presents two versions of the Markov Chain Noise Benefit Algorithm. The first algorithm shows how a Markov chain simulation can apply the Markov Chain Noise Benefit theorem directly to realize an optimal noise benefit. The second algorithm shows a practical implementation that uses only the current and past states of the simulation. Algorithm 2.1 shows a naive application of the Markov Chain Noise Benefit The- orem. The green lines on Figures 2.6, 2.9, and 2.12 show simulation results from this algorithm. This algorithm has the practical limitation that it requires prior knowledge of 33 the steady state distribution. The algorithm finds the component with the smallest state error at each step. It then adds signed noise to compensate for the error. Algorithm 2.1 The Optimal Markov Chain Noise Benefit Algorithm 1: procedure MarkovChain(x 0 , P, x 1 ) 2: x t x 0 3: repeat 4: x t x t P 5: x t NoisyStep(x t ; P; x 1 ) 6: until isConverged(x t ) 7: return x t 8: procedure NoisyStep(x t , P, x 1 ) 9: n t CalcNoise(x t ; P; x 1 ) 10: e x t 1 1+ P n t (x t + n t ) 11: return e x t 12: procedure CalcNoise(x t ; P; x 1 ) 13: (x t x 1 ) P 14: L Length() 15: A [0] 16: k 0 17: for j 1; L do 18: ifj[ j]j< A then 19: A [ j] 20: k j 21: n ZeroVector(L) 22: n[k] A 23: return n Algorithm 2.2 overcomes the limitation of Algorithm 2.1 because it does not require knowledge of the steady-state values. It uses only the past state probabilities to deter- mine the noise at each time step. Algorithm 2.2 picks the state that changes the most at each time step and then adds noise to drive that state further in its current direction. The pink lines on Figures 2.6, 2.9, and 2.12 show that on average Algorithm 2.2 speeds convergence in the three Markov chain simulations. 34 Algorithm 2.2 The Blind Markov Chain Noise Benefit Algorithm 1: procedure MarkovChain(x 0 , P) 2: x t x 0 3: repeat 4: x t x t P 5: x t NoisyStep(x t ; P; x t1 ) 6: until isConverged(x t ) 7: return x t 8: procedure NoisyStep(x t , P, x t1 ) 9: n t CalcNoise(x t ;n t1 ; x t1 ) 10: e x t 1 1+ P n t (x t + n t ) 11: return e x t 12: procedure CalcNoise(x t ;n t1 ; x t1 ) 13: x t n t1 x t1 14: L Length() 15: A [0] 16: k 0 17: for j 1; L do 18: ifj[ j]j> A then 19: A [ j] 20: k j 21: n ZeroVector(L) 22: n[k] Uniform(0; A) 23: return n 2.6 Markov Chain Experimental Results The simulations below show that the proposed noise benefit applies to a wide range of Markov chain models. The three simulations show the evolution of the state density by direct computation of x t+1 = x t P. Figures 2.6, 2.9, and 2.12 show the probabil- ity of several states over time. The first simulation applies noise to the 2-parameter Ehrenfest diusion model. The simulation reaches steady state about 24% faster than the simulation without noise and provides evidence that the Markov Chain Noise Ben- efit theorem can apply to birth-death processes. The second simulation demonstrates that the Wright-Fisher population genetics model benefits from noise by decreasing the time to convergence. The third simulation shows that noise can speed simulations of a 35 proposed chemical reaction whose state transition matrix derives from empirical mea- surement data. 2.6.1 Noise Benefits in the Ehrenfest Diusion Model The first simulation shows a noise benefit in the Ehrenfest diusion model. Paul Ehren- fest proposed a diusion model in the early 1900s as a statistical interpretation of the second law of thermodynamics [182, 89]. The model demonstrates the increase in entropy of a closed system over time [192]. The simulation shows that the noise benefit theorem applies to a class of Markov models called birth-death processes. A birth-death process has the constraint P i j = 0 ifji jj> 1 [89, 261, 202, 23, 80]. The simulation also demonstrates a noise benefit in a model that converges only in distribution. Figure 2.6 illustrates the noise benefit in an N = 12 molecule Ehrenfest diusion simulation. Table 2.1 show how each state i corresponds to a distribution of 12 molecules divided between two compartments A and B. No. molecules No. molecules State i in A in B 1 12 0 2 11 1 12 1 11 13 0 12 Table 2.1: Number of molecules (N = 12) per compartment in simulation state i The simulation employed a 2-parameter generalized model with s = 0:10 and t = 0:90. The figure shows that the components approach their steady state values 24.2% faster on average with added noise (error< 0:5% of steady state) . The simplest Ehrenfest diusion model uses a rectangular container with a perme- able membrane separating two equally sized compartments called compartment A and compartment B [315, 14, 281]. The container holds N gas molecules that the membrane allows to pass between compartments (Figure 2.7). 36 no noise optimal noise suboptimal noise blind 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 time t Prob X i (t) suboptimal noise no noise optimal noise blind noise (a) i = 1 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 time t Prob X i (t) (b) i = 2 0 20 40 60 80 100 0 0.1 0.2 0.3 time t Prob X i (t) (c) i = 3 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 time t Prob X i (t) (d) i = 4 Figure 2.6: Noise benefits in the 2-parameter (Krat-Schaefer) Ehrenfest diusion model. Noise enhances the 2-parameter Ehrenfest diusion model by decreasing the time to convergence. The figures summarize the results from a 12-molecule simulation with s = 0:10 and t = 0:90. These figures show the time evolution of the first 4 com- ponents of the 13-component state vector X (t) (corresponding to [X 1 ] i > 0:002). Each component of the state vector gives the probability of a particular distribution of the 12 molecules between compartments A and B. Case i = 1 corresponds to all 12 molecules in box A and i = 2 corresponds to 11 molecules in compartment A and 1 molecule in com- partment B. The blue (dotted) curve plots the standard (no noise) Ehrenfest diusion model. The green (dashed), red (dash-dot), and pink (solid) curves show noisy versions of the model. The noise benefit appears in the distinct shift to the left of the noise- enhanced simulations over the standard model. This shows that the simulations reach steady state sooner. The green (dashed) curve shows a simulation using the optimal noise N opt according to the Markov Chain Noise Benefit Theorem and the red (dash- dot) curve shows the result by choosing suboptimal noise uniformly in h 0; N opt i . The pink (solid) curve shows the results of Algorithm 2.2. Algorithm 2.2 does not require prior knowledge or an estimate of the steady state distribution. The figures show that this system nears steady state within 60 time steps. 37 Figure 2.7: Two compartment Ehrenfest diusion model. The figure illustrates the diu- sion experiment of Ehrenfest. The box contains N = 20 molecules. The compartments A and B partition the box. x(t) represents the number of molecules in box A at time t. This example assumes x(t) = 16. The simulation randomly selects a molecule at each time step (red circle) and moves the selected molecule to the other compartment (red arrow). Here x(t + 1) = 16 1 = 15 since one molecule moves from A to B. The model exhibits a dynamic equilibrium because molecules continue to shuttle across the membrane for all t. In particular: x(t), x(t + 1). So the occupancy x(t) converges in distribution. The model randomly selects a molecule at each time step t and then moves that molecule to the other compartment. x(t) denotes the number of molecules in compart- ment A at each time step. So x(t)2f0;1;2;; Ng. The simulation tends toward a steady state distribution with maximal entropy as t!1 [225]. The Ehrenfest model is a birth- death process because x(t) either increments or decrements by one at each time step 38 [266]. Suppose the container contains N molecules and has 0< M < N molecules in compartment A at time t. Then x(t) = M and P[x(t + 1) = M 1] = M N P[x(t + 1) = M + 1] = 1 M N The Markov chain x(t) evolves according to the state transition matrix P where P i j = 8 > > > > > > > < > > > > > > > : Ni N j = i+ 1 i N j = i 1 0 else ; for 0 i; j N [175]. This model converges in distribution since x(t), x(t + 1) for all t. The Krat-Schaefer extension adds two new parameters to the Ehrenfest diusion model to model asymmetry between transitions from A! B and B! A [205]. The two parameters s and t characterize the transition asymmetry and scale the respective conditional transition probabilities from A! B and B! A [76, 175]. This corresponds physically to the membrane “preferring” diusion in one direction over the other (Figure 2.8). The generalized diusion model evolves as a birth-death process with state transition matrix P where P i j = 8 > > > > > > > > > > > < > > > > > > > > > > > : Ni N s j = i+ 1 i N t j = i 1 1 N [(1 s) N + i(s t)] j = i 0 else ; for s;t2 [0;1] and integers 0 i; j N [76]. The Krat-Schaefer model converges with probability one only for the trivial case where one of the compartments is a perfect sink 39 Figure 2.8: Two compartment Krat-Schaefer asymmetric diusion model. The fig- ure illustrates the membrane “preference” in the asymmetric Krat-Schaefer diusion model. Here s t. So P[B! A] P[A! B] for a particular molecule (indicated by relative size of arrows). The asymmetry shifts the equilibrium to the left so that more molecules tend to accumulate in A at steady state. (when s = 0 or t = 0). The model weakens to convergence in distribution for all other s and t such that 0< s;t< 1. Figure 2.6 shows a simulation that initialized x 0 as a normalized random state vector. This represents starting the diusion simulation with uncertainty in the system’s config- uration. The simulation used s = 0:10 and t = 0:90 to slow convergence and highlight the noise benefit. The asymmetry due to s = 0:10 and t = 0:90 collapses the dominant eigengapj 1 jj 2 j where i is the i th largest magnitude eigenvalue. This increases the time for the simulation to reach steady state. A similar benefit exists for all s and t in (0;1). A wider eigengapj 1 jj 2 j ensures that the chain quickly converges toward steady state. This results in a smaller noise benefit. 40 2.6.2 Noise Benefits in a Population Genetics Model The second simulation shows a noise benefit in the Wright-Fisher population genetics model. The Wright-Fisher model uses a Markov chain to simulate stochastic genotypic drift during successive generations [108, 367, 105]. Figure 2.9 illustrates the noise benefit in a simulation with 2 alleles and N = 50 diploid individuals. The Wright-Fisher model applies generally to populations under the following assumptions [16]: 1. the population size N remains constant between generations 2. no selective dierence between alleles 3. non-overlapping generations. Consider a gene with 2 alleles (A 1 and A 2 ) in a population with N diploid individ- uals. The population contains 2N copies of the gene since each diploid individual has 2 copies of the gene. Let the state vector x(t) represent the allele distribution at time t [360]. Then at time t: x 0 (t) = P 0 copies A 1 ;2N copies A 2 x 1 (t) = P 1 copies A 1 ;2N 1 copies A 2 x 1 (t) = P 2 copies A 1 ;2N 2 copies A 2 x 2N (t) = P 2N copies A 1 ;0 copies A 2 : The Wright-Fisher model produces successive generations with a 2-step process (Figure 2.10). The model first creates N pairs of parents selected randomly and with replacement from the population. Then each pair produces a single ospring with its genotype inherited by selecting one gene from each parent. All parents die after mating. The allele distribution x(t) is a Markov chain that advances by random sampling with replacement from the pool of parent genes (Figure 2.11) [218, 359]. The density of alleles evolves according to a binomial probability density with x(t + 1) = j(x(t) = ij ) Bin j;2N; i 2N (2.64) 41 no noise optimal noise suboptimal noise blind 0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 generation t Prob X i (t) suboptimal noise no noise blind noise optimal noise (a) i = 1 0 100 200 300 400 500 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 generation (t) Prob (b) i = 2 Figure 2.9: Noise benefits in Wright-Fisher population genetics. The figures show that the Wright-Fisher model benefits from noise in two ways: (1) noise moves the simu- lation toward the steady state value faster than simulations without noise and (2) noise eliminates the asymptotic crawl toward the steady state value. The figures represent allele distributions over 500 reproductive generations. The simulation models the dis- tribution of a diallelic gene (A 1 and A 2 ) in a population with N = 50 diploid individuals (2N = 100 gene copies). The simulation initialized with X t N (50;1) – a normal distri- bution centered at a population with equal numbers of A 1 and A 2 . The symmetry of X 0 implies that the steady state population will move toward either homozygous coalescent state (A 1 A 1 or A 2 A 2 ) with equal probability. The blue (dotted) curve plots the standard (no noise) Wright-Fisher model. The green (dashed), red (dash-dot), and pink (solid) curves show noisy versions of the model. The green (dashed) curve shows a simulation using the optimal noise N opt prescribed by the noise benefit theorem and the red (dash- dot) curve shows the result by choosing suboptimal noise uniformly in h 0; N opt i . The pink (solid) curve shows that noisy algorithm can benefit a Markov Chain even if it can- not determine the steady-state values that the theorem assumes. (a) The probability for a homozygous A 1 (A 1 A 1 ) population. The initial distribution implies a near zero proba- bility for a pure A 1 A 1 population but rapidly increases to its steady-state value of 1 2 . The noise enhanced simulations (red and green) approach the steady-state value faster than the standard model (blue) and also reach the asymptotic value before settling. (b) The probability for a population with 25 copies of allele A 1 and 75 copies of allele A 2 . The noise enhanced simulations (red and green) approach the steady-state value faster than the standard model (blue) and also reach the asymptotic value before settling. Thus the Markov chain transition matrix has elements [360]: P i; j = 0 B B B B B @ 2N j 1 C C C C C A i 2N j 1 i 2N 2N j : 42 t=n t=n+1 random pairing Figure 2.10: Wright-Fisher mating. The figure illustrates how the Wright-Fisher model produces successive generations. Each doublet in the first row represents the genotype of a diploid individual from a population N = 4. Each organism possesses a pair of alleles (blue = A 1 and red = A 2 ). The two middle rows show how the model randomly pairs individuals with replacement to form mates. Each doublet in the last row represents an ospring. The ospring inherit one allele (A 1 or A 2 ) randomly from each parent. Then the simulation “kills” the t = n population and the ospring become the new t = n + 1 population. Figure 2.11 also demonstrates how the allele distribution x(t) converges to the steady- state. x(t) converges with probability one to either of the homozygous populations – either (A 1 ; A 1 ) or (A 2 ; A 2 ) [26]. This convergence is much stronger than the convergence in distribution found in the Ehrenfest diusion model. The Wright-Fisher simulation used a population N = 50 diploid individuals. The simulation tracked the allele distribution of a diallelic gene: A 1 and A 2 . It initialized the allele distribution x(0) according to a normal distribution with a mean of 50 copies of A 1 and 50 copies of A 2 . This initial distribution represents imperfect information about the population’s initial genotypic makeup. The simulation evolved four separate copies of the initial population following the Fisher-Wright procedure: (1) standard (no noise), (2) applying Algorithm 2.1 – adding optimal noise N opt at each iteration as prescribed by the theorem, (3) adding suboptimal noise uniformly chosen from [0; N opt ], (4) applying Algorithm 2.2. Each copy ran for 500 generations. 43 t=0 t=1 t=2 t=3 t=4 Figure 2.11: Markov dynamics of a Wright-Fisher genotype. Each of the 6 circles for t = 0 represents an allele for a particular gene (blue = A 1 and red = A 2 ). The Wright- Fisher model generates the t = 1 ospring by randomly sampling the t = 0 population with replacement. The connections indicate the surviving genes and their ospring. The A 1 allele becomes extinct by the fourth generation in this example. The steady state for this example is homozygous (A 2 ; A 2 ) because future generations can no longer inherit the extinct A 1 gene. Figure 2.9 shows two modes of noise benefit in the Wright-Fisher simulation: (1) noise shifts the over-damped system (damping ratio > 1) into an near critically- damped regime ( 1) and (2) noise speeds the asymptotic approach towards the steady state distribution. Each plot in the figure represents the estimate of the proba- bility for a single genotypic distribution: 2.9.(a) shows P 100 copies A 1 ;0 copies A 2 and Figure 2.9.(b) shows P 50 copies A 1 ;50 copies A 2 during the 500-step simulation. 44 The population will be homozygous at steady state either: (1) (A 1 ; A 1 ) or (2) (A 2 ; A 2 ) based on the stochastic dynamics of the system. This particular simulation shows that P steady state = (A 1 ; A 1 ) = 0:5 = P steady state = (A 2 ; A 2 ) . This is the expected result because of the symmetric initial uncertainty for A 1 and A 2 . Figure 2.9.(a) shows that Algorithm 2.2 can introduce oscillations in the density esti- mate. The oscillations have a short-lived eect in this simulation. The ringing quickly dies down and the estimate settles to the theoretical limit = 0:5. The simulations in the other sections do not show this ringing artifact. We do not know if this ringing artifact arises from some relation between the state transition probabilities, the number of states, or some other condition unique to this model. Figure 2.9 also shows that even non-optimal noise can benefit the simulation dynam- ics. The probability of the homozygous state in Figure 2.9.(a) is one of the two dis- tributions with non-zero steady state probability: P steady state = (A 1 ; A 1 ) , 0. The suboptimal noise simulation (red curve) shows similar benefits to the optimal noise sim- ulation (green curve) since the traces of the two simulations resemble each other. This also appears to be an artifact of some special condition in this model. 2.6.3 Noise Benefits in a Chemical Reaction Model The third simulation shows a noise benefit in a zeolite crystallization model. Figure 2.12 shows a benefit in a 6-state chemical network simulation. The simulation extended an earlier study [147, 85] that investigated a proposed crystallization process for natural zeolite [70]. The figure shows that the components approach their steady state values (within 0.5% of steady state) 18.1% faster on average with added noise. Thus the noise benefit extends generally to a large domain of problems that employ observed transition matrices. We rarely deal with a pure Markov process in practice. We are even less likely to have complete knowledge of the state transition matrix. Researchers that model complex processes often estimate the transition matrix with approximate conditional transition probabilities calculated from a series of observations [72, 372, 224, 67, 325, 291, 282, 155]. Zeolites are a class of aluminosilicates that form naturally under geologic condi- tions [61, 375]. Geologists have identified 40 naturally occurring zeolite frameworks. Chemists have synthesized over 175 unique varieties [12, 112]. Zeolites find uses in 45 no noise optimal noise suboptimal noise blind 0 200 400 600 800 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 time t Prob Dimer (a) Dimer 0 200 400 600 800 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 time t Prob Pentamer optimal noise blind noise suboptimal noise no noise (b) Pentamer 0 200 400 600 800 0 0.2 0.4 0.6 0.8 1 1.2 time t Prob Zeolite (c) Zeolite Figure 2.12: Noise benefits in an empirical chemical network Markov model. The fig- ures show that noise enhances a crystallization model for zeolite resulting from a com- plex chemical process. The benefit appears in the shift to the left of the noise-enhanced simulations over the standard model. This indicates that the simulations reach steady state sooner. This simulation shows that the predicted noise benefit may be small but the benefit will exist for any simulation that has not converged. These curves represent concentrations of three species involved in a hypothetical zeolite synthesis over time. The blue (dotted) curve plots the standard (no noise) Markov chemical reaction model. The green (dashed), red (dash-dot), and pink (solid) curves show noisy versions of the model. The green (dashed) curve shows a simulation using the optimal noise N opt in accord with the Markov Chain Noise Benefit Theorem. The red (dash-dot) curve shows the result when the simulated added suboptimal noise drawn uniformly from h 0; N opt i . The pink (solid) curve shows the results of Algorithm 2.2. Over time the zeolite con- centrations dominate the other species. many industries. These include water purification [68, 353, 10], detergents [1, 370], catalysis [361, 333, 286, 25], and nuclear reprocessing [93, 92, 109]. 46 The exact natural hydrothermal synthesis of many zeolites is not known [126, 220, 125, 124]. Researchers have employed Markov models to predict properties from the complex chemistry involved in their formation [154, 102, 368]. Geologists constructed an observed transition matrix in one such model based on 29 Si concentration profiles during a formation experiment [85]. They then determined rate constants, equilibrium constants, and free energies for elementary zeolite-forming reactions for a hypotheti- cal zeolite-formation process using Markov simulations with the estimated transition matrix and initial species concentrations. Our simulations show that noise benefits such a Markov model. Hawkins [147] empirically found the following state transition probability matrix for 6 silica oligomers from aggregate NMR data using weighted least squares: P = (2.65) 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0:9274 0:0700 0:0025 8 10 5 10 5 10 5 0:0500 0:8395 0:1000 0:0100 0:0004 0:0001 0:0600 0:0600 0:8495 0:0300 0:0004 0:0001 0:0500 0:0100 0:0400 0:5400 0:0600 0:3000 0:0500 0:0200 0:0200 0:0500 0:8595 0:0005 0:0001 0:0001 0:0001 0:0001 0:0001 0:99953 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 corresponding to steady-state probability density x 1 : x 1 = (2.66) h 0:026 0:017 0:013 0:002 0:002 0:942 i : Figure 2.13 summarizes the principal reaction pathways. We used the experimental 29 Si NMR data reported earlier [147] to initialize the species concentrations to x 0 = h 0:430 0:260 0:220 0:060 0:030 0:000 i and advanced the Markov chain to simulate the crystallization of zeolite. Our simulations show that noise benefits the empirical estimation but the observed benefit was small. The performance metric showed a strong benefit of 18.1% despite 47 O H O H OH OH O H OH O Si Si O H O H OH O H OH OH O O O Si Si Si OH OHOH OH Si O H Si O H O Si O HOH O Si OH OH O Si OH OH O Si O H O H O orthosilicic acid disilicic acid pyrosilicic acid cyclic tetramer cyclic pentamer O H Si OH O Si OH OH O Si OH O H O Si O H O H O zeolite Figure 2.13: Zeolite reaction scheme of Hawkins [147]. Simulations show that noise speeds the convergence of this model to its steady state concentrations. The model synthesizes zeolite from 5 silicate oligomers. The reaction arrows show the dominant model pathways. The state transition matrix (2.65) defines each pathway as a Markov transition probability from one species to another during one time step. The vector (2.66) lists the steady state concentrations of the 6 reactants. The system saturates with zeolite because the model lacks strong pathways that consume zeolite. some states experiencing only minimal noise benefits (Figure 2.12.(a) and 2.12.(c)). This is because the noise quickly moved a few components to their steady-state value (Figure 2.12.(b)). The Markov Chain Noise Benefit Theorem could not provide addi- tional benefit to the system after this initial boost because the theorem relies on the mag- nitude of the component closest to its steady state value. Several components converged within a few time-steps. So the theorem-based noise added only small corrections to the states for the rest of the simulation. This shows that the theorem confers a larger benefit to systems with states that converge at approximately the same rate. But other Markov systems still receive some noise benefit. 48 2.7 Markov Chain Noise Benefit Theorem Simulation Two final simulations show the noise benefit that exists for Markov chain simulations. The simulations show how the Markov Chain Noise Benefit Theorem might speed con- vergence in modern algorithms such as the Google PageRank TM link analysis algorithm [265, 134, 133]. The PageRank algorithm constructs a probability density that repre- sents the likelihood that a person randomly clicking on links will arrive at a particular page over all indexed pages on the Internet [34]. The algorithm operates on a dataset called the Google matrix. This matrix is equivalent to a Markov state transition matrix spanning tens of billions of dimensions [223, 173]. The Noise Benefit Theorem shows that the algorithm should benefit from noise. 2.7.1 One-step Markov Chain Simulation The first simulation shows that a Markov chain can benefit from additive noise (Figure 2.14). The simulation shows the benefit after one time-step as a decrease of the absolute error between the posterior state density and the stationary state density. Table 2.2 and Table 2.3 show a large decrease in the absolute error in the noisy simulations compared to the no-noise simulations. Table 2.2 summarizes the one-step experiment with and without noise. The simula- tion classified the states as satisfying either the conditions of the Markov Chain Noise Benefit Theorem i = (x x 1 ) P i > 0 or the Corollary i = (x x 1 ) P i < 0. It set the noise strength to A = minfa i g for each class in accord with the theorem. This gave A = 0:0682 for the states with i > 0 and A =0:1594 for the states with i < 0. The simulation calculated the total absolute error for each class using the respective values for the noise strength A. States satisfying States satisfying MC Theorem corollary no noise 0.1547 0.1547 with noise 0.0547 0.0724 error decrease (%) 64.64% 53.20% Table 2.2: Noise benefits in one-step Markov chain simulation. 49 0 0.1 0.2 0.3 0.4 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Noise Benefit Region a i |(x+n) P − x ∞ | State 2 State 3 State 5 (a) i = (x x 1 ) P i > 0 0 0.1 0.2 0.3 0.4 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Noise Benefit Region a i |(x+n) P − x ∞ | State 1 State 4 State 6 (b) i = (x x 1 ) P i < 0 Figure 2.14: Noise benefits in Markov chain density estimation. These figures show the relation between the error magnitude of each Markov state and the noise strength a i . The simulation used a 6-state Markov chain and the figure shows the single-step absolute errors by state. Each of the six states satisfied either (a) the Markov Chain Noise Benefit Theorem: (x x 1 ) P i > 0 or (b) the Corollary: (x x 1 ) P i < 0. (a) Three states satisfy the inequality (x x 1 ) P i > 0 in this simulation. Each curve represents the absolute error e xP x 1 i of the i th state as a i increases. The standard zero-noise condition corresponds to a i = 0. Each state has an optimal noise level A i indicated by the point where the curve meets the a i -axis. The optimal noise A i will exactly drive the state to its stationary value. The Markov Chain Noise Benefit Theorem first shows that the benefit exists for all a i < A i . The theorem also guarantees the existence of a global A = minfA i g> 0 such that any noise a< A benefits every state that satisfies the inequality. All curves decrease (strictly) monotonically until they reach A i . Thus any point between the no-noise condition and A i shows some benefit and A = minfA i g satisfies this constraint for each such state. (b) Three states satisfy the alternative inequality (x x 1 ) P i < 0. These correspond to the states that satisfy the corollary. The corollary ensures a point A so that any noise-strength less than A benefits every such state. Table 2.3 summarizes a simulation with and without noise that does not have access to signs of the inequality. It shows that a noise benefit exists even if the simulation cannot classify individual states according to i > 0 or i < 0. The table summa- rizes the relative improvement over all N = 6 states when setting the noise strength to A = sign(a i ) min(ja i j). This gave A = min(0:0682;0:1594) = 0:0682. The simulation calculated the total absolute error for the posterior state density using this value of A. 50 no noise 0.3093 with noise 0.2370 error decrease (%) 23.38% Table 2.3: Noise benefits in one-step Markov chain simulation – unknown error sign The Markov Chain Noise Benefit Theorem ensures that there exists a noise distri- bution that reduces the state error. Figure 2.14 illustrates this because it shows that the error decreases as the noise strength increases from zero. The theorem and corollary also establish that past some noise strength (A> 0) the error will increase. Thus prop- erly signed noise with magnitude less than A guarantees that the absolute error will be lower in the noisy simulation than in the no-noise simulation. Figure 2.14 shows an example where three of the N = 6 states obey the inequality (2.13) in the main theorem and the remaining three states obey the inequality (2.53) in the corollary. Not all transition matrices P have this even splitting. But any given matrix will have at least one state that satisfies each case since the sum of the signed errors must equal 0. The simulation generated a Markov chain from a fixed random transition matrix where b P i; j = P X k+1 = jjX k = i U (0;1). The simulation used N = 6 states. The theo- rem and corollary guarantee the benefit for transition matrices with any finite dimension. But uniformly chosen transition matrices tend to generate a uniform stationary density: x 1 j = 1 N : (2.67) We transformed each transition probability by U (0;1)+ to construct a network of states with non-uniform importance. We chose > 0:04 to avoid numerical instability. This gives a transition matrix b P i; j = U i; j V i; j + (2.68) 51 where U i; j U (0;1) and V i; j U (0;1). We normalized the rows of b P to form a proper stochastic matrix: P i; j = b P i; j P k=1 P i;k : (2.69) We chose the initial state density x as the uninformed prior [139] (uniform distribution) over the 6 states so that x j = 1 N : (2.70) We used Matlab R2009b to perform the simulations with transition matrix P: P = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0:038 0:040 0:077 0:070 0:065 0:710 0:017 0:109 0:140 0:128 0:234 0:372 0:014 0:022 0:062 0:174 0:005 0:723 0:027 0:053 0:068 0:184 0:058 0:611 0:071 0:075 0:015 0:132 0:011 0:696 0:181 0:177 0:484 0:017 0:068 0:073 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 corresponding to steady-state probability density x 1 : x 1 = h 0:089 0:102 0:241 0:094 0:065 0:408 i 2.7.2 Two-step Markov Chain Simulation The second simulation shows that the noise benefits in the one-step simulation extend over successive time steps (Figure 2.15). We measure the benefit as a decrease in the absolute error between the posterior state density and the stationary state density. The simulation also shows that even suboptimal noise in one time step can still benefit suc- cessive steps. The proof guarantees that there exists a noise density that will reduce the error over multiple time steps. 52 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 1 2 3 4 5 6 7 8 x 10 −3 second step: a i first step: a i |(x+n) P − x ∞ | Figure 2.15: Multi-cycle noise benefits in Markov chain density estimation. This figure shows that the noise benefits apply for successive Markov steps. It further shows that even suboptimal noise in one iteration can still benefit successive steps. The simulation evaluated the deciding inequalities (2.13) and (2.53) for a single state at two successive time steps and used only the sign (+ or) to determine the direction of beneficial noise for the state. The plot shows the relation between the state’s absolute error and the noise magnitude during the “first step” and “second step” (with the appropriate sign). The origin of the a (step 1) 1 and a (step 2) 2 axes corresponds to a zero-noise 2-step Markov chain. The optimal noise corresponds to a (step 1) i 0:165 during the first step. Then there is a strictly positive value for a (step 2) i that yields a lower error than the zero-noise case even if the system applies suboptimal noise during the first step — such as a i = 0:10. 53 We generated a transition matrix P using the same procedure as in the one-step simulation (2.69): P = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0:147 0:013 0:051 0:667 0:062 0:061 0:158 0:030 0:088 0:622 0:012 0:090 0:078 0:061 0:095 0:582 0:077 0:108 0:138 0:106 0:055 0:565 0:039 0:098 0:171 0:085 0:213 0:085 0:170 0:276 0:048 0:028 0:070 0:804 0:030 0:020 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 This corresponds to the steady-state probability density x 1 = h 0:129 0:077 0:068 0:582 0:048 0:094 i 2.8 Conclusion We have shown that noise can benefit Markov chain estimation by speeding up the con- vergence time if the algorithm can calculate the sign of the state error. We have also shown how a simulation can use estimates of the error magnitude to update its current estimate of the underlying state density. Simulations confirm that noise can benefit a single-step or multi-step system even if the system has insucient information to deter- mine the optimal noise. Versions of the Markov Chain Noise Benefit Theorem may well hold for weaker assumptions and other Markov chain models. An open question is whether the results hold for noise-perturbed Markov transition matrices instead of noise-perturbed state densities. This may apply to simulations with noisy estimates of the transition matrix or to simulations with transition matrix estimates based on only a few observations. This might also apply to specific MCMC algorithms under suitable assumptions. Adaptive algorithms may be able to find optimal noise amounts in many of these cases. 54 Chapter 3 Noisy Markov Chain Monte Carlo (N-MCMC) This chapter shows how carefully injected noise can speed the convergence of Markov chain Monte Carlo (MCMC) estimation. The major result is the noise-injected version of the Metropolis-Hastings algorithm: the Noisy Metropolis-Hastings (N-MH) algo- rithm. The N-MH algorithm uses noise to bring each Markov step closer on average to the equilibrium pdf. This in turn reduces the burn-in time of the algorithm. The N-MCMC theorem proves that the noisy Markov chain is closer on average to the equi- librium density than in the noiseless case subject to a condition on the injected noise. A corollary result describes an alternate form of the condition that may be easier to verify in some conditions. Two corollary results present special cases of the theorem assuming Gaussian and Cauchy target densities. A final N-MCMC corollary shows that MCMC can also realize a noise benefit under multiplicative noise. 3.1 Markov Chain Monte Carlo MCMC methods pose a statistical solution to the following problem: how does one eciently search a complex high dimensional space. MCMC algorithms generalize the concept of search by assuming that not all points in the sample space are equal. The space contains low probability valleys punctuated by high probability peaks. MCMC navigates the search space in a probabilistic way. MCMC comes equipped with the promise that “all roads lead to Rome”. In high dimensions exact solutions to these problem become intractable because it is often not possible to explicitly state the shape of the probability terrain. Monte Carlo methods tackle problems by sampling from a space to construct iter- ative improvements to a probabilistic estimate. Monte Carlo usually employs one of two sampling methods: i.i.d. sampling (i.i.d. MC) or dependent sampling (MCMC). I.I.D. sampling uses a naive uniform sample approach to from the entire space. MCMC 55 sampling methods use carefully constructed random walks to explore the probability surface. MCMC algorithms use random jumps to ensure that they fuse local knowledge with novel information from new regions of the space. MCMC often works well for problems of high dimension. It is a natural algorithm to apply to many big-data problems since big data can both create and exaggerate “big dimension.” The next section describes how Bayesian inference appeals to MCMC to reduce the computation burden associated with the methods. 3.1.1 Introduction A recent review ranks the Metropolis algorithm amount the ten most influential algo- rithms of the 20th century [19]. The Metropolis algorithm is a special case from a large body of Markov Chain Monte Carlo (MCMC) sampling algorithms. These methods find significant application in the fields of statistical physics, chemistry, pharmacology, opti- mization, decision theory, Bayesian inference, particle systems, and finance. For many high-dimensional problems MCMC is the only ecient method to provide a solution within reasonable time [94, 166]. A few examples of MCMC major application areas include [4] 1. Bayesian inference and learning encompass three major subclasses of problems. The Bayesian model assumes one has data y2 Y and unknowns x2 X. (a) Normalize the posterior. Bayes theorem requires the full normalizing factor for the posterior p(xjy) = p(yjx) p(x) R X p(yjx) p(x) dx (3.1) given the prior p(x) and likelihood p(yjx). (b) Marginalize the joint pdf. Integrate to obtain the marginal posteriors p(xjy) = Z Z p(x;zjy) dz (3.2) given the joint posterior p(x;zjy). 56 (c) Compute summary statistics. Usually in the form of expectations E p(xjy) f = Z X f (x) p(xjy) dx (3.3) for some function f : X!R n f integrable with respect to the posterior p(xjy). 2. Statistical mechanics focuses on computing the partition function Z of a system with states s Z = X s exp " E (s) kT # (3.4) where E (s) is the energy Hamiltonian, k is Boltzmann’s constant, and T is the system temperature. The high dimensionality of the state space S generally makes this impossible outside of a sampling estimates. 3. Optimization. The goal is to identify the state that minimizes a cost function amoung the entirety of the space. Monte Carlo (MC) methods emerged from Los Alamos, New Mexico following World War II [95, 153, 295]. Stan Ulam hypothesized that a Monte Carlo sampling algorithm could find the the probability of winning a a 52-card solitaire game. Ulam described the idea to John von Neumann. The collaborations coincided with a paral- lel development of the first general-purpose computer at the University of Pennsylvania called ENIAC in 1946. John Mauchly and J. Presper Eckert designed and developed ENIAC to compute artillery firing tables for the United States Army’s Ballistic Research Laboratory. John von Neumann saw the utility of ENIAC on carrying out the massive computations in the MC sampling algorithms. He set the machine to working on com- putations for thermonuclear fusion as early as 1947. The Metropolis Algorithm and Hastings Generalization Nick Metropolis directed the development of MANIAC (Mathematical Analyzer, Inte- grator, and Computer). He designed the MANIAC according to von Neumann principles 57 [39]. The Metropolis algorithm [239] is the first MCMC algorithm. The primary focus of the original Metropolis paper is to compute integrals onR 2N of the form F = R F ()exp[E ()=kT] d R exp[E ()=kT] d ; (3.5) where denotes a set of N particles in the planeR 2 and the energy E defined by E () = 1 2 N X i=1 N X j=1; j,i V d i j (3.6) where V is a potential function between two particles i and j in with Euclidean distance d i j . The temperature T parameterizes the Boltzmann distribution exp[E ()=kT] with a normalization factor Z (T) = Z exp[E ()=kT] dx (3.7) where k is the Boltzmann constant. Computing these integrals is generally not possible due to the the high dimensional (2N) space. Standard Monte Carlo methods also fail to compute these integrals in most instances because exp[E ()=kT] is vanishingly small for most realizations. This results in round o errors that corrupt the entire estimateF . The 1953 paper proposes a random walk method to more eciently sample the con- figuration space of the N particles. A step in the walk proceeds by perturbing a single particle in space according to x 0 i = x i + x;i (3.8) y 0 i = y i + y;i : (3.9) where x;i and y;i are symmetric uniform random samples U (1;1). The method then computes the energy dierence E between the previous step and the new step. The walk accepts the new configuration with probability = min 1;exp[E=kT] : (3.10) 58 The walk remains in the previous state if it rejects the candidate state. The paper pro- ceeds to show that the walk is irreducible and that it is ergodic. The analysis works under the simplification of a discretized space to show that each step is reversible and finally that exp(E=kT) is invariant on the density. The following summarizes the original Metropolis algorithm. 1. Choose an initial x 0 with f (x 0 )> 0. 2. Generate a candidate x t+1 by sampling from the jump distribution Q(yjx t ). The jump pdf must be symmetric: Q(yjx t ) = Q(x t jy). 3. Calculate the density ratio for x t+1 : = p x t+1 p(x t ) = f x t+1 f (x t ) . Note that the normalizing constant K cancels. 4. Accept the candidate point (x t+1 = x t+1 ) if the jump increases the probability (> 1). Also accept the candidate point with probability if the jump decreases the probability. Else reject the jump (x t+1 = x t ) and return to step 2. Hastings generalized the Metropolis algorithm as a tool to overcome the curse of dimensionality in standard Monte Carlo methods [146]. His extension introduces a modified acceptance probability = s i j 1+ i j q i j q ji (3.11) where i is the target density, and q i j the proposal density. Hastings’ generalization thus removes the symmetry constraint on the proposal density q by using the reverse transition probability to induce reversibility. 3.1.2 Monte Carlo MCMC is a specialization of standard Monte Carlo. Monte Carlo (MC) simulation draws i.i.d. samples from a target density p(x) to estimate complex calculations. The idea is to use the set of i.i.d. samples x 1 ; x 2 ;:::; x N to approximate the target density with the following empirical probability mass function (pmf): Suppose we wish to calculate Z b a h(x) dx: (3.12) 59 The goal of MC is to deconstruct the integral into an expectation calculation. Decom- posing h(x) into the product of a function f (x) and a probability density function p(x) is a common method: Z b a h(x) dx = Z b a f (x) p(x) dx = E f (x) (3.13) Samples (x 1 ;; x n ) from p(x) then approximate the integral by: Z b a h(x) dx = E f (x) 1 n n X i=1 f (x i ) (3.14) MC employs conditional pdfs when calculating posterior distributions in Bayesian analysis. To compute the integral I (y) = R f (yjx) p(x) dx simulations can instead sample p(x) and then evaluate b I (y) = 1 n n X i=1 f (yjx i ) (3.15) The MC Ergodic Averaging theorem shows that the strong law of large numbers applies to functions of MCMC samples under weak assumptions. Theorem 3.1 (MC Ergodic Averaging Theorem). Suppose X t is a Markov chain with a unique stationary distribution and suppose f is a real-valued function such that R j fj d<1. Then for all X t with P[X t ]> 0 and all initial values X 0 with P[X 0 ]> 0 1 N N X i=1 f (X i )! Z f (X) d: (3.16) almost surely. The following theorem shows that the central limit theorem applies to the standard- ized sample mean of f (X i ). 60 Theorem 3.2 (MC Central Limit Theorem). Suppose X t is a Markov chain with a unique stationary distribution and suppose f is a real-valued function such that R j fj d<1. Then there exists a real number( f ) such that 1 N P N i=1 f (X i ) R f (X) d p N d !N 0;f( f )g 2 (3.17) for any starting value X 0 with P[X 0 ]> 0. The following section presents two widely applied Monte Carlo sampling methods: accept-reject sampling and importance sampling. Monte Carlo Accept-Reject Sampling Accept-reject sampling generates samples from a target probability density f (x) [296, 28]. Sampling from f (x) may be dicult without a closed form inverse cdf F 1 (x). The accept-reject algorithm introduces a proxy distribution g(x) that is more amenable to sampling. The algorithm requires that the instrumental distribution g(x) dominate f (x): cg(x) f (x) for all x and some constant c> 1. Accept-reject sampling generates candidate samples Y i from g(x). It accepts or rejects Y i depending on a threshold set by a ratio of pdfs: f (Y i ) cg(Y i ) . The accept-reject algorithm falls into the broad class of Monte Carlo statistical tech- niques. The algorithm is a limiting version of Markov chain Monte Carlo under uniform i.i.d. candidate sampling. Accept-reject applies under very weak conditions. But the algorithm can suer from a low acceptance-rate in naïve applications. Suppose we want to generate samples from a random variable X with pdf f (x). Accept-reject sampling generates X i by: 1. Generate Y G. 2. Generate T Uniform(0;1). 3. Set X i = Y if T f (Y) cg(Y) (3.18) else return to step 1. 61 f (Y) and g(Y) are both random variables and so the ratio f (Y) cg(Y) is also a random variable. 0 f (y) cg(y) 1 (3.19) since f (y) cg(y) for all y. The ratio is also independent of U. Therefore the number of times N that the algorithm repeats before it accepts X is itself a random variable with a geometric distribution having success probability p = P h U f (Y) cg(Y) i . Thus P[N = n] = (1 p) n1 p for n 1 and E [N] = 1 p . Direct calculation shows that p = 1 c since U f (Y) cg(Y) (Y = yj =) f (y) cg(y) and p = P U f (Y) cg(Y) j Y = y ! (3.20) = Z 1 1 f (y) cg(y) g(y) dy (3.21) = 1 c Z 1 1 f (y) dy (3.22) = 1 c : (3.23) Equation (3.23) shows that P Accept Y = p = 1 c . This bears noting since it says that the expected number of iterations before the algorithm accepts a sample is exactly the bounding constant c sup x n f (x) g(x) o . Thus the algorithm discards c1 of every c samples. A short proof of the Accept-reject algorithm follows. It expands a conditional pdf and checks that X F. Theorem 3.3 (Accept-Reject Algorithm). Suppose Y is a random variable with pdf g(y) and cdf G (y). Let T Uniform(0;1) and suppose T and Y are independent. Let f (x) denote the pdf and F (x) denote the cdf of a random variable X. Suppose f (x) cg(x) for all x2R and some constant c> 1. Then P Y yj U f (Y) cg(Y) ! = F (y) (3.24) 62 Proof. Define B =fY yg and A = n U f (Y) cg(Y) o . P[B] = p = 1 c by Eq. (3.20)–(3.23). Also P(Aj B) = P(Bj A) P[A] P[B] (3.25) so P Y yj U f (Y) cg(Y) ! = P U f (Y) cg(Y) j Y y ! G (y) 1 c (3.26) = F (y) cG (y) G (y) 1 c (3.27) = F (y) (3.28) where Eq. (3.27) follows because P U f (Y) cg(Y) j Y y ! = P hn U f (Y) cg(Y) o \fY yg i G (y) (3.29) = Z y 1 P U f (y) cg(y) j Y = t y G (y) g(t) dt (3.30) = 1 G (y) Z y 1 f (t) cg(t) g(t) dt (3.31) = 1 cG (y) Z y 1 f (t) dt (3.32) = F (y) cG (y) (3.33) Monte Carlo Importance Sampling Importance sampling is another pure Monte Carlo sampling technique to generates sam- ples from a target probability density f (x) [321, 293, 322, 330]. The idea behind impor- tance sampling is to change the probability measure so that estimating the expectation is easier. It introduces a new arbitrary density called the importance proposal density q(x) so that I ( f ) = Z f (x) p(x) dx = Z f (x)w(x)q(x) dx (3.34) 63 where w(x) p(x) q(x) is the importance weight. Importance sampling simulates i.i.d. ran- dom draws x 1 ; x 2 ;:::; x N from q(x) and evaluates w(x i ) for each to compute a Monte Carlo estimate b I N ( f ) = N X i=1 f (x i )w(x i ): (3.35) An important metric for choosing an optimal proposal density is to minimize the vari- ance of the estimator b I N ( f ). Expanding the variance of f (x)w(x) with respect to q(x) Var( f (x)w(x)) = E q(x) f 2 (x) w 2 (x) I 2 (x) (3.36) shows that we only need to minimize the first term on the right hand side since the second term does not depend on q(x). Using Jensen’s inequality to bound the term gives E q(x) f 2 (x) w 2 (x) = E q(x) ( f (x) w(x)) 2 (3.37) E q(x) ( f (x) w(x)) 2 (3.38) = Z j f (x)j p(x) dx ! 2 : (3.39) Choosing q (x) = j f (x)j p(x) R j f (x)j p(x) dx (3.40) achieves the lower bound and q is called the optimal importance distribution. But this is an impractical auxiliary density because it is not in general any more easy to sample from q . This leads instead to the intuition that high sampling eciency correlates with sam- pling from p(x) in the “important” regions (wherej f (x)j p(x) is relatively large). This also implies the existence of super-ecient sampling estimates. Super-ecient esti- mates can compute estimates of a given function f (x) with lower variance than a perfect Monte Carlo method with q(x) = p(x). Rare-event methods can exploit this property to boost the eciency of estimating functions that focus on tail events [37]. 64 Finding a suitable q(x) becomes more dicult as the dimension of x increases. Adaptive importance sampling introduces a learning mechanism to sampling. A com- mon method uses a parameterized proposal density q(xj) and adapt to tune the evolu- tion of the samples. A popular implementation computes the derivative of the variable term in (3.36) D() = E q(xj) " f 2 (x)w(xj) @w(xj) @ # (3.41) and then updating as t+1 = t 1 N N X i=1 f 2 (x i )w(x i j t ) @w(x i j t ) @ t (3.42) where is the learning rate and the samples x i q(xj). Adaptive importance sampling approaches can also use higher-order Hessian estimates to improve estimation. Importance sampling also poses an asymptotic extension to allow application with- out prior knowledge of the p(x) normalizing constant. The extension begins by rewriting I ( f ) as I ( f ) = R f (x)w(x)q(x) dx R w(x)q(x) dx (3.43) with w(x)/ p(x) q(x) known only up to a normalizing constant. Then the Monte Carlo esti- mate becomes I ( f ) = R f (x)w(x)q(x) dx R w(x)q(x) dx (3.44) P N i=1 f (x i )w(x i ) P N i=1 w(x i ) (3.45) = N X i=1 f (x i )e w(x i ) (3.46) = e I ( f ) (3.47) 65 where e w(x i ) is the normalized importance weight. e I ( f ) is biased for finite N. Under weak conditions though the strong law of large numbers gives asymptotic convergence e I ( f ) a:s: ! I ( f ): (3.48) Slight constraints on f and p also lead to a central limit theorem convergence in distri- bution. It is often impossible to construct an auxiliary sampling density to drive ecient pure Monte Carlo estimation. Monte Carlo methods tackle problems by sampling from a space to construct iterative improvements to a probabilistic estimate. MCMC builds on Monte Carlo to use only local knowledge of the target density to construct estimates. MCMC sampling methods use carefully constructed random walks to explore the prob- ability surface. MCMC algorithms use random jumps to ensure that they fuse local knowledge with novel information from new regions of the space. 3.1.3 Markov Chains Govern Transitions Between States A Markov chain is a memoryless random process with transitions from one state to another that obey the Markov property P(X t+1 = xj X 1 = x 1 ;:::; X t = x t ) = P(X t+1 = xj X t = x t ): (3.49) P represents the single-step transition probability matrix where P i; j = P(X t+1 = jj x t = i) (3.50) is the probability of the chain in state i at time t moving to state j at time t + 1. State j is accessible from state i if there is some non-zero probability of transitioning from state i to state j (i! j) in any finite number of steps P (n) i; j > 0 (3.51) for some n> 0. A Markov chain is irreducible if every state is accessible from every other state [296, 242]. Irreducibility implies that for all states i and j there exists m> 0 66 such that P(X n+m = jj X n = i) = P (m) i; j > 0. This holds if and only if P is a regular stochastic matrix. The period d i of state i is d i = gcd n n 1 : P (n) i;i > 0 o or d i =1 if P (n) i;i = 0 for all n 1 where gcd denotes the greatest common divisor. State i is aperiodic if d i = 1. A Markov chain with transition matrix P is aperiodic if d i = 1 for all states i. A sucient condition for a Markov chain to have a unique stationary distribution is that the state transitions satisfy detailed balance: P j! k (x) j = P k! j (x) k for all states j and k. We can also write this as Q(kj j)( j) = Q( jjk)(k). This is called the reversibility condition. A Markov chain is reversible if it satisfies the reversibility condition. Markov Chain Monte Carlo algorithms exploit the Markov convergence guarantee to construct Markov chains with samples drawn from complex probability densities. But MCMC methods suer from problem-specific parameters that govern sample accep- tance and convergence assessment [335, 127]. Strong dependence on initial conditions also biases the MCMC sampling unless the simulation allows a lengthy period of “burn- in” to allow the driving Markov chain to mix adequately 3.1.4 MCMC in Bayesian inference We can invoke Bayes theorem P(HjE) = P(H) P(EjH) P(E) (3.52) to recast intractable decision and optimization problems into a probabilistic estimation framework. The engine of Bayesian estimation also leads to the form of Bayes theorem that drive big-data f (jx)/ g(xj)h() (3.53) Bayesian inference is a computational challenge. It often appeals to Monte Carlo approximations to estimate complex integrals. Monte Carlo methods exploit the fact 67 that averaging the function of uniform samples converges to the integral of the function on [0;1]: 1 N N X k=1 f (U i )! Z 1 0 f (x) dx: (3.54) Bayesian inference often must evaluate expectations of complex high dimensional pdfs E X g(x) = Z 1 1 g(x) f X (x) dx: (3.55) Monte Carlo samples from the region of integration to estimate approximate the inte- grals. Monte Carlo usually employs one of two sampling methods: i.i.d. sampling (i.i.d. MC) or dependent sampling (MCMC). Naive I.I.D. Monte Carlo implementations pro- duce estimates but accuracy improvements come at the costly price of more lots data. Naive sampling methods display 1= p N convergence. That means that 10 fold (e.g.one decimal point) estimate improvement requires 100 times more samples. MCMC sam- pling methods use carefully constructed random walks to explore the probability sur- face. Dependence between samples ensures that the algorithm prefers sampling from high probability regions. Thus MCMC algorithms generate more ecient estimates by using fewer samples. 3.2 MCMC Convergence The most dicult question to answer within an MCMC application is: when to stop sampling? MCMC guarantees with very few conditions that it will converge to the target estimate given enough time. In this section I develop results that formalize the asymptotic convergence of MCMC. I also describe the proof for the key fact describing geometric convergence rate of MCMC algorithms. 68 3.2.1 Aperiodic -irreducible MCMC Converge to a Unique Sta- tionary Density Convergence analysis usually hinges upon measuring the “distance” from the time evolved Markov chain and the target density. The total variation is the most common distance measure between these two densities in the MCMC literature Definition 3.1. The total variation distance between two probability measures 1 () and 2 () is: k 1 () 2 ()k = sup A j 1 (A) 1 (A)j (3.56) Then the question of convergence resolves to: is lim n!1 kP n (x;)()k = 0? And the question of convergence rate reduces to: given > 0 how large must n be so that kP n (x;)()k<? Proposition 3.1 lists several nice properties of the total variation distance metric (proof in [299]). Proposition 3.1. (a) k 1 () 2 ()k = sup f :X![0;1] Z f d 1 Z f d 2 : (3.57) (b) k 1 () 2 ()k = 1 b a sup f :X![a;b] Z f d 1 Z f d 2 (3.58) for any a< b and in particular k 1 () 2 ()k = 1 2 sup f :X![1;1] Z f d 1 Z f d 2 : (3.59) (c) If () is stationary for a Markov chain kernel P, thenkP n (x;)()k is non- increasing in n, that is P n (x;)() P n1 (x;)() (3.60) for all n2N. 69 (d) More generally, letting ( i P)(A) = R i ( dx) P(x; A) gives k( 1 P)() ( 2 P)()kk 2 () 2 ()k: (3.61) (e) Let t (n) = 2 sup x2X P n (x;)() ; (3.62) where() is stationary. Then t is sub-multiplicative t (m+ n) t (m)t (n) (3.63) for all m; n2N. (f) If () and () have densities g and h with respect to some -finite measure () and M = max(g;h) and m = min(g;h) then k()()k = 1 2 Z X (M m) d (3.64) = 1 Z X m d (3.65) (g) Given probability measures() and(), there are jointly defined random variables X and Y such that X(), Y(), and P[X = Y] = 1k()()k: (3.66) The existence of a stationary distribution() is not sucient for Markov chain con- vergence. I next introduce the concept of-irreducibility which generalizes the concept for irreducibility in finite state-space Markov chain theory. Definition 3.2. A chain is-irreducible if there exists a non-zero-finite measure on X such that for all AX with (A)> 0, and for all x2X there exists a positive integer n = n(x; A) such that P n (x; A)> 0. This condition requires that the chain can reach every set of non-zero probability within a finite number of transitions. This implies that if a chain has a single state reach- able from anywhere then it is -irreducible. But -irreducibility is also not sucient 70 for convergence since periodicity can corrupt the chain. I introduce the concept of ape- riodicity of -irreducible Markov chain as a generalization of the concept from finite state-space Markov chain theory. Definition 3.3. A Markov chain with stationary distribution pi() is aperiodic if there do not exist d 2 and disjoint subsetsX 1 ;X 2 ;:::;X d X with P(x;X i+1 ) = 1 for all x2 X i (1 i d1), and P(x;X 1 ) = 1 for all x2X d such that(X 1 )> 0 (and hence(X i ) for all i). Otherwise the chain is periodic, with period d, and has periodic decomposition X 1 ;X 2 ;:::;X d . The main MCMC asymptotic convergence result follows. The theorem imposes the assumption that the state-spaces-algebra is countably generated. This is a weak assumption in general because it holds for any countable state space, or any subset R d with the usual Borel -algebra (i.e. -algebra generated by the balls with rational centers and rational radii). The proof of this main theorem relies on a number of results and makes use of a coupling construction. I present the proof components after the statement of the proof and a corollary. Theorem 3.4. If a Markov chain on a state space with countably generated-algebra is-irreducible and aperiodic, and has a stationary distribution() then lim n!1 P n (x;)() = 0 (3.67) for-a.e. x2X . And lim n!1 P n (x; A) =(A) (3.68) for all measurable AX in particular. A corollary shows that functions of the probability space also obey a nice form of convergence. Corollary 3.1. Under the conditions of theorem 3.4. If h :X!R with(jhj)<1 then a strong law of large numbers holds lim n!1 1 n n X i=1 h(X i ) =(h) with probability 1: (3.69) 71 MCMC algorithms adhere precisely to the -irreducibility and aperiodicity con- straints in theorem 3.4 to construct chains that both have a stationary density () and converge to that density. 3.2.2 Coupling Constructions for Convergence Proofs In this section I present a method of proving the main convergence theorem using the concept of chain coupling. Coupling arguments are well suited to analyzing MCMC algorithms. The basic coupling idea is to consider the distance between two jointly defined random variables on a spaceX . Suppose we have two random variables X and Y jointly defined on a spaceX . Denote their probability densities byL (X) andL (Y) and consider their total-variation distance kL (X)L (Y)k = sup A jP[X2 A] P[Y2 A]j (3.70) = sup A jP[X2 A; X = Y]+ P[X2 X; X, Y] P[Y2 A;Y = X] P[Y2 A;Y, X]j (3.71) = sup A jP[X2 A; X, Y] P[Y2 A;Y, X]j (3.72) P[X, Y] (3.73) So the total variation distance between the two random variables is bounded by the probability that they are unequal kL (X)L (Y)k P[X, Y] (3.74) Coupling arguments make extensive use of small sets to bound iterated probabilities. Definition 3.4. A subset CX is small if there exists a positive integer n 0 ,> 0, and a probability measure() onX such that the following minorization condition holds: P n 0 (x;)() (3.75) for all x2 C. That is P n 0 (x; A)(A) (3.76) 72 given any measurable AX Now suppose that C is a small set. The following splitting technique [262, 9, 243] is to run two copiesfX n g and X 0 n of the Markov chain. Each copy runs follows the usual update rules P[x;] but the joint construction means there is a high probability of becoming equal. 1. Initialize X 0 = x and X 0 0 (). Set n = 0 and repeat the following conditional forever: 2. If X n = X 0 n then choose X n+1 = X 0 n+1 P[X n ;]. Set n = n+ 1. Else if X n ; X 0 n 2 CC then: (a) with probability choose X n+n 0 = X 0 n+n 0 (;) (b) else with probability 1, choose conditionally independently X n+n 0 1 1 P n 0 (X n ;)() (3.77) X 0 n+n 0 1 1 P n 0 X 0 n ; () : (3.78) For n 0 > 1 iteratively construct X n+1 ;:::; X n+n 0 1 from the correct con- ditional distributions given X n and X n+n 0 and X 0 n+1 ;:::; X 0 n+n 0 1 from the correct conditional distributions given X 0 n and X 0 n+n 0 and Set n = n+ n 0 . Else conditionally independently choose X n+1 P[X n ;] and X 0 n+1 P X 0 n ; . Set n = n+ 1. The construction ensures that X n and X 0 n evolve according to the correct transition kernel P. Thus P[X n 2 A] = P n (x;) and P X 0 n 2 A =(A) for all n. Note that chains run independently until they both enter C. Then the minorization splitting occurs to ensure successful coupling. The coupling inequality gives P n (x;)() P X n , X 0 n : (3.79) But the main convergence theorem 3.4 does not assume that any small set C exists. Jain and Jameson [164] provide a useful result showing the existence of small sets. The 73 main idea behind the theorem is to decompose P n 0 (x;) into an absolutely continuous portion with respect to the measure and then find a set C with(C)> 0 such that the density part is at least> 0 over C. Theorem 3.5. Every-irreducible Markov chain on a state space with countable gen- erated-algebra contains a small set CX with(C)> 0. The minorization measure () specified in defining the small set C satisfies(C)> 0. To achieve the convergence result it suces to show that the coupled pair X n ; X 0 n hit CC infinitely often. Therefore they will have infinitely many opportunities to couple each with probability p> 0. Hence this proves the theorem since they will eventually couple with probability 1. I present the conclusion of the proof as a series of lemmas outlined in [302]. Lemma 3.1. Consider a Markov chain on a state spaceX having stationary distri- bution (). Suppose that P[ A <1]> 0 for some AX . Then P[ A <1] = 1 for -almost-every x2X . Proof. Suppose the contrary to the conclusion. Then (x2X : P[ A =1]> 0)> 0: (3.80) The following two claims (the proofs follow) provide lower bounds on the probabilities. Claim 3.1. Condition (3.80) implies there are constants `;` 0 2N, > 0, and BX with(B)> 0 such that P h A =1;sup n k 1; X k` 0 2 B o <` i (3.81) for x2 B. Claim 3.2. Define B,`,` 0 as in Claim 3.1. Let L =`` 0 and let S = supfk 1; X kL 2 Bg with the convention that S =1 if the setfk 1; X kL 2 Bg is empty. Then Z x2X ( dx) P h S = r; X jL < A i (B) (3.82) for all integers 1 r j. 74 Assuming the claims we have by stationarity that for any j2N A C = Z x2X ( dx) P jL x; A C (3.83) = Z x2X P h X jL < A i (3.84) j X r=1 Z x2X P h S = r; X jL < A i (3.85) j X r=1 (B) (3.86) = j(B): (3.87) But j> 1 (B) gives A C > 1 is impossible. Therefore the contradiction complete the Lemma proof. Claim 3.1. Condition (3.80) implies there are constants `;` 0 2N, > 0, and BX with(B)> 0 such that P h A =1;sup n k 1; X k` 0 2 B o <` i (3.88) for x2 B. Proof. By (3.80) there is a 1 and a subset B 1 X with(B 1 )> 0 such that P[ A <1] 1 1 for all x2 B 1 . But P[ A <1]> 0 for all x2X so there is an` 0 2N, 2 > 0, and B 2 B 1 with(B 2 )> 0 and with P ` 0 (x; A) 2 for all x2 B 2 . Set =] n k 1; X k` 0 2B 2 o . Then P A =1; = r (1 2 ) r (3.89) for all r2N and x2X . P A =1; =1 = 0 in particular. Hence P A =1;<1 = 1 P A =1; =1 P[ A <1] (3.90) 1 0+ (1 1 ) = 1 : (3.91) 75 for x2 B 2 Therefore there is`2N,> 0, and B B 2 with(B)> 0 such that P h A =1;sup k 1; X k` 0 2 B 2 <` i (3.92) for x2 B. But B B 2 so sup k 1; X k` 0 2 B 2 sup k 1; X k` 0 2 B : (3.93) Claim 3.2. Define B,`,` 0 as in Claim 3.1. Let L =`` 0 and let S = supfk 1; X kL 2 Bg with the convention that S =1 if the setfk 1; X kL 2 Bg is empty. Then Z x2X ( dx) P h S = r; X jL < A i (B) (3.94) for all integers 1 r j. Proof. Using stationarity and claim 3.1 Z x2X ( dx) P h S = r; X jL < A i = Z x2X ( dx) Z y2B P rL (x; dy) P y S =1; X ( jr)L < A (3.95) = Z x2X Z y2B ( dx) P rL (x; dy) P y S =1; X ( jr)L < A (3.96) = Z y2B ( dy) P y S =1; X ( jr)L < A (3.97) Z y2B ( dy) (3.98) =(B) (3.99) The proof of 3.4 proceeds by choosing a Theorem 3.5 small set. Then consider the coupling constructionf(X n ;Y n )g. Let GXX be the set of (x;y) such that P x;y (9n 1; X n = Y n ) = 1. Thus if X 0 ; X 0 0 x; X 0 0 G then lim n!1 P X n = X 0 n = 1 by 76 the coupling construction. This would imply lim n!1 kP n (x;)()k = 0 thus proving the theorem. So it suces to show that P h x; X 0 0 2 G i = 1 (3.100) for-a.e. x2X . Define G as above. Let G x =fy2X : (x;y)2 Gg (3.101) for x2X and let G =fx2X :(G x ) = 1g: (3.102) The theorem follows from Lemma 3.2. Lemma 3.2. G = 1 Proof. I first show that(G) = 1. (C)> 0 by Theorem 3.5. It follows from lemma 3.3 that from the joint chain has positive probability of eventually hitting C C from any (x;y)2XX . Lemma 3.1 shows that the join chain will return to C C with probability 1 from ()-a.e. (x;y)< C C. But once the joint chain reaches C C the chain will either couple or the chain will update from R which must be absolutely continuous with respect to and hence it will again return to with probability 1. Thus the joint chain repeatedly returns to C C with probability 1 until X n = X 0 n . The coupling construction leads to the chain coupling with probability p > 0 each time it is in CC. Thus the chain will eventually couple and X n = X 0 n . Thereby proving (G) = 1. But if G < 1 then G C = Z x2X ( dx) G C x (3.103) = Z G C ( dx)[1(G x )] (3.104) > 0: (3.105) This contradicts the fact that(G) = 1. 77 The final lemma in the proof uses the notion of a petite set to describe the aperiod- icity of a Markov chain. Definition 3.5. A subset CX is petite relative to a small set C if there exists a positive integer n 0 , an> 0, and a probability measure() onX such that n 0 X i=1 P i (x;)() (3.106) for all x2 C. A petite set is intuitively like a small set except that it allows dierent states in C to cover the minorization measure () at dierent times i. Every small set is petite. But a petite set is not in general small because it does not rule out periodic chains from returning to a particular set at fixed recurring intervals. Lemma 3.3. Consider an aperiodic Markov chain on a state spaceX with stationary distribution(). Let() be a probability measure onX . Assume that()() and that there is n = n(x)2N and =(x)> 0 such that P n (x;)() for all x2X . Note that this condition holds trivially if() is a minorization measure for a small or petite set that is reachable from all states. Let T = ( n 1;9 n such that Z ( dx) P n (x;) n () ) : (3.107) Then there is n 2N with Tfn ;n + 1;n + 2;:::g. Proof. T is non-empty since P (n(x)) (x;)(x)() for all x2X . Now suppose n;m2 T. Then T is additive (i.e. if n;m2 T then n+ M2 T) because Z x2X ( dx) P n+m (x;) = Z x2X Z y2X ( dx) P n (x; dy) P m (y;) (3.108) Z y2X n ( dy) P m (y;) (3.109) n m (): (3.110) 78 Now we proceed to show gcd(T) = 1. If T is non-empty and additive, and if gcd(T) = 1 then there is an n 2N such that Tfn ;n + 1;n + 2;:::g [27, 304] Suppose instead that gcd(T) = d> 1. For 1 i d define X i = n x2X ;9`2N;> 0 such that P `di (x;)() o : (3.111) Note that [ d i=1 X i =X (3.112) by construction. Let S[ i, j X i \X j (3.113) be the union of pairwise intersections betweenX i . Then let S = S[ x2X ;9m2N such that P m (x;S )> 0 (3.114) and X 0 i = X i n S: (3.115) X 0 1 ;X 0 2 ;:::;X 0 d are disjoint by construction. And if x2X 0 i then P h x;S i = 0 so that P h x;[ d j=1 X 0 j i = 1. It must also be that P h x;X 0 i+1 i = 1 for i< d because if not then x2X 0 j 1 \X 0 j 2 for j 1 , j 2 which contradicts the disjoint requirement. It follows that for all m 0 thatP m X i \X j = 0 if i, j. Since ifP m X i \X j > 0 for i, j then we could find an S 0 X ,` 1 ;` 2 2N, and> 0 such that P ` 1 d+i (x;)() (3.116) P ` 2 d+i (x;)() (3.117) (3.118) for all x2 S 0 . This would imply that` 1 d+i+m2 T and` 2 d+ j+m2 T and this contracts gcd(T) = d. 79 Therefore S = 0 by the sub-additivity of measures. So [ d i=1 X 0 i = [ d i=1 X i (3.119) =(X ) (3.120) = 1: (3.121) Therefore [ d i=1 X 0 i > 0 (3.122) since. ThereforeX 0 1 ;:::;X 0 d are subset with positive -measure. But this implies that Markov chain is periodic with period d and this contradicts the assumption of aperiod- icity. 3.2.3 MCMC Exhibits a Uniform Ergodicity The previous section resolves the question about whether a particular MCMC chain will converge to a unique stationary distribution. Any -irreducible and aperiodic Markov chain converges to the target density. The question remains about how fast it gets there. In this section I present results that bound to the convergence rate. This section will state and prove theorems that first show that the convergence is uniformly ergodic. I will also show that the convergence is in fact geometric. I will present several recent theorems [78] that attempt to better characterize the converge of Metropolis-Hastings algorithms specifically and discuss how these may lead to potential noisy Metropolis- Hastings convergence bounds. Theorem 3.4 shows asymptotic convergence to stationarity. Definition 3.6. A Markov chain having a stationary distribution () is uniformly ergodic if P n (x;)() M n (3.123) where n = 1;2;3;::: for some< 1 and M<1. An equivalent condition for uniform ergodicity follows. 80 Proposition 3.2. A Markov chain with stationary distribution(x) is uniformly ergodic if and only if sup x2X P n (x;)() < 1 2 (3.124) for some n2N. Proof. First the forward direction. Suppose the chain is uniformly ergodic. Then lim n!1 X x2X P n (x;)() lim n!1 M n (3.125) = 0: (3.126) Therefore sup x2X P n (x;)() < 1 2 : (3.127) for large enough n. Conversely suppose sup x2X P n (x;)() (3.128) for some n2N. Then d (n)< 1 by Proposition 3.1.(e). So d ( jn) (d (n)) j (3.129) = j : (3.130) for all j2N Thus Proposition 3.1.(c) gives P m (x;)() P b m n cn (x;)() (3.131) 1 2 d m n n (3.132) b m n c (3.133) 1 1 n m : (3.134) Therefore the chain is uniformly ergodic by assigning M = 1 <1 and = 1 n < 1. 81 Doeblin [84] and Doob [86] showed that MCMC guarantees uniform ergodicity under weak assumptions on the Markov chain. The following theorem uses the cou- pling construction to provide a meaningful bound onkP n (x;)()k. It shows that P n (x;)() (1) n n 0 : (3.135) This gives a direct insight into MCMC convergence in some contexts since one can compute n that bounds thekP n (x;)()k once n 0 and are known. Theorem 3.6. Consider a Markov chain with invariant probability distribution (). Suppose that the small set minorization condition is satisfied for some n 0 2N,> 0, and probability measure() in the special case C =X . Then the chain is uniformly ergodic and P n (x;)() (1) n n 0 (3.136) for all x2 X, wherebrc is the greatest integer not exceeding r. Proof. C =X so every n 0 iterations have probability at least of making X n and X 0 n equal. Thus if n = n 0 m then P X n , X 0 n (1) m : (3.137) Therefore P X n , X 0 n (1) m (3.138) = (1) n n 0 (3.139) by the coupling inequality. Thus it follows from Proposition 3.1(c) that P n (x;)() (1) n n 0 (3.140) for all n. 82 3.2.4 MCMC Sometimes Exhibit Geometric Ergodicity Geometric convergence a weaker condition than uniform ergodicity. Definition 3.7. A Markov chain with stationary distribution () is geometrically ergodic if P n (x;)() M (x) n (3.141) where n = 1;2;3;::: for some< 1, where M (x)<1 for-a.e. x2X . Geometric ergodicity introduces dependence on the initial condition x to the scal- ing factor M. All irreducible and aperiodic Markov chains on a finite state spaceX are both geometrically and uniformly ergodic. But this does not always hold for infi- nite state spaceX . For instance the symmetric random-walk Metropolis algorithm is geometrically ergodic if and only if() has finite exponential moments. For a general Markov chain geometric ergodicity depends on a drift condition. Definition 3.8. A Markov chain satisfies a drift condition if there are constants 0<< 1 and b<1, and a function V :X ! [1;1] such that PVV + b1 C (3.142) for all x2X . The following proof uses a coupling construction to show conditions sucient to guarantees geometric ergodicity. Theorem 3.7. Consider a -irreducible and aperiodic Markov chain with stationary density(). Suppose that some set CX ,> 0, and probability measure() satisfy the minorization condition (3.75). Suppose further that chain satisfies the drift condition (3.142) for some constants 0<< 1 and b<1, and a function V :X ! [1;1] with V (x)<1 for at least one x2X . Then the chain is geometrically ergodic. The proof makes use of several auxiliary results which follow. The first theorem requires an augmented bivariate drift condition of the form Ph(x;y) h(x;y) (3.143) 83 for (x;y)< CC for some function h :XX ! [1;1) and some> 1 where Ph(x;y) Z X Z X h(z;w) P(x; dz) P(y; dw): (3.144) It relies on the following Proposition to relate the univariate and bivariate drift condi- tions. Proposition 3.3. Suppose that V :X ! [1;1] satisfies the univariate drift condition (3.142) for some CX , < 1, and <1. Let d = inf x2C C V (x), Then the bivariate drift condition (3.143) is satisfied for the same C with h(x;y) = 1 2 V (x) + V (y) and 1 =+ b d+1 < 1 if d> b 1 1: (3.145) Proof. Suppose (x;y)< CC. Then either x< C or y< C so h(x;y) 1+d 2 and PV (x) + PV (y)V (x) +V (y) + b. Then Ph(x;y) = 1 2 PV (x) + PV (y) (3.146) 1 2 V (x) +V (y) + b (3.147) =h(x;y) + b 2 (3.148) h(x;y) + b 2 ! 2 6 6 6 6 6 4 h(x;y) 1+d 2 3 7 7 7 7 7 5 (3.149) = " + b 1+ d # h(x;y): (3.150) Thus+ b 1+d < 1 since d> b 1 1. Theorem 3.8. Consider a Markov chain on a state spaceX with transition kernel P. Suppose there is a CX , an h :XX ! [1;1), a probability density() onX , and> 1, n 0 2N, and > 0 such that both the minorization condition (3.75) and the bivariate drift condition (3.143). Define B n 0 = max " 1; n 0 (1) sup CC Rh # : (3.151) 84 Then for any joint initial distributionL X 0 ; X 0 0 , and integers 1 j k L (X k )L X 0 k TV (1+) j + k B n 0 j1 E h h X 0 ; X 0 0 i (3.152) iffX n g and X 0 n are two copies of the Markov chain started in the join initial distribution L X 0 ; X 0 0 . Choosing j =brkc for suciently small r> 0 in particular gives an explicit quantitative convergence bound that exponentially goes to 0 as k!1. Proof. The proof follows the general outline of [303]. First assume that n 0 = 1 in the minorization condition for the small set C and denote B n 0 by B. Let N k =] m : 0 m k; X m ; X 0 m 2 CC : (3.153) Let 1 ; 2 ;::: be times of successive visits of X n ; X 0 n to CC. Then P h X k , X 0 k i = P h X k , X 0 k ; N k1 j i + P h X k , X 0 k ; N k1 < j i (3.154) for any integer j with 1 j k. Then P h X k , X 0 k ; N k1 j i (1) j (3.155) since the event of j coin flips all coming up tails contains the event n X k , X 0 k ; N k1 j o . To bound the second term in the inequality let M k = k B N k1 h X k ; X 0 k 1 X k , X 0 k (3.156) where k = 1;2;3;::: and N 1 = 0 by convention. The following lemma provides the necessary bound on M k . Lemma 3.4.fM k g is a supermartingale: E h M k+1 jX 0 ;:::; X k ; X 0 0 ;:::; X 0 k i M k (3.157) 85 Proof. Suppose X k ; X 0 k < CC. Then N k = N k1 so E h M k+ 1 jX 0 ;:::; X k ; X 0 0 ;:::; X 0 k i = k+1 B N k1 E h h X k+1 ; X 0 k+1 1 X k+1 , X 0 k+1 jX k ; X 0 k i (3.158) k+1 B N k1 E h h X k+1 ; X 0 k+1 jX k ; X 0 k i 1 X k+1 , X 0 k+1 (3.159) = M k E h h X k+1 ; X 0 k+1 jX k ; X 0 k i h X k ; X 0 k (3.160) M k (3.161) Now suppose X k ; X 0 k 2 CC. Then N k = N k1 + 1. If X k = X 0 k the result follows imme- diately. If X k , X 0 k then E h M k+ 1 jX 0 ;:::; X k ; X 0 0 ;:::; X 0 k i = k+1 B N k1 1 E h h X k+1 ; X 0 k+1 1 X k+1 , X 0 k+1 jX k ; X 0 k i (3.162) = k+1 B N k1 1 (1)Rh X k ; X 0 k (3.163) = M k B 1 (1) Rh X k ; X 0 k h X k ; X 0 k (3.164) M k (3.165) ThereforefM k g is a supermartingale. 86 Since B 1 the proof of the theorem proceeds P h X k , X 0 k ; N k1 < j i = P h X k , X 0 k ; N k1 j 1 i (3.166) P h X k , X 0 k ; B N k1 B ( j1) i (3.167) = P h 1 X k , X 0 k B N k1 B j1 i (3.168) B j1 E h 1 X k , X 0 k B N k1 i (3.169) B j1 E h 1 X k , X 0 k B N k1 h X k ; X 0 k i (3.170) = k B j1 E [M k ] (3.171) k B j1 E [M 0 ] (3.172) = k B j1 h X 0 ; X 0 0 [:] (3.173) Combining the two bounds proves the theorem for n 0 = 1. For n 0 > 1 the main chain is to skip counting visits to C C where the joint chain could not attempt coupling. These visits correspond to fill-in times where the chain constructed X n+1 ;:::; X n+n 0 in step 2 of the coupling construction. Thus instead define N k as the number of visits to CC excluding the fill-in times and letf i g represent the actual visit times. Then replace N k1 by N kn 0 in (3.154) and when defining M k . The supermartingale lemma now holds for n M t(k) o where t (k) is the latest visit time t k that does not correspond to a fill-in time. With these changes the theorem holds in the more general case for n 0 1. The proof begins by defining the function h(x;y) = 1 2 V (x) + V (y) The follow- ing technical lemma allows us to assume a finite bound on V (x) in the proof of the geometric-ergodicity theorem. Theorem 3.9. Without loss of generality sup x2C V (x)<1: (3.174) Specifically given a small set C and a drift function V satisfying (3.75) and (3.142) we can find a small set C 0 C where (3.75) and (3.142) still hold with the same n 0 ,, and b but replaces by some 0 < 1 and such that (3.174) also holds. 87 Proof. Define and b as in (3.142). Choose with 0<< 1. Let 0 = 1 and define K = b 1 . Set C 0 = C\fx2X : V (x) Kg: (3.175) Then C 0 satisfies (3.75) since C 0 C. We now verify that C 0 also satisfies (3.142). Suppose x2 C 0 nC. Then (3.142) holds. Now V (x) K for x2 CnC 0 . Starting with the original condition (3.142) gives (PV)(x)V (x) + b1 C (x) (3.176) = (1)V (x) (1)V (x) + b (3.177) (1)V (x) (1) K + b (3.178) = (1)V (x) (3.179) = 0 V (x): (3.180) Thus (3.142) still holds on C 0 . Thus the proof assumes that (3.174) holds. It uses this with (3.142) to ensure that sup (x;y)2CC R h(x;y)1: (3.181) This implies that B n 0 in (3.151) is finite. To continue the proof let d = inf C C V. Proposition 3.3 shows that the bivariate drift condition (3.143) holds if d> b 1 1. The proof follows immediately in this case by combining Proposition 3.3 with Theorem 3.8. But if d b 1 1 the chain may not be aperiodic so the proof requires an alternate route. Instead the proof enlarges C so that the new value of d satisfies the bound d> b 1 1. Aperiodicity ensures that C remains a small set The theorem will then follow from Proposition 3.3 and Theorem 3.8 as above. Choose d 0 > b 1 1. Let S =fx2X ;V (x) dg and set C 0 = C[ S . Notice that inf x2C 0C V (x)d 0 (3.182) > b 1 1: (3.183) 88 Also (3.143) continues to hold on C 0 since V is bounded on S by construction. Therefore B n 0 <1 holds. Thus it suces to show that C 0 remains a small set. The following lemma completes the proof through the use of petite sets (Definition 3.5). Lemma 3.5. C 0 is a small set. Proof. The proof follows from two sub-lemmas Lemma 3.6. All petite sets are small sets for an aperiodic-irreducible Markov chain. Lemma 3.7. Let C 0 = C[ S where S =fx2X ;V (x) dg for some d<1. Then C 0 is petite. Proof. Choose N large enough such that r 1 N d> 0. Define the first return time to C as C = inffn 1; X n 2 Cg: (3.184) Define Z n = n V (X n ) and W n = Z min(n; C ) . The drift condition (3.142) implies that W n is a supermartingale. This follow since C n gives E [W n+1 jX 0 ;:::; X n ] = E Z C jX 0 ;:::; X n (3.185) = Z C (3.186) = W n (3.187) and C > n implies X n < C so E [W n+1 jX 0 ;:::; X n ] = (n+1) (PV)(X n ) (3.188) (n+1) V (X n ) (3.189) = n V (X n ) (3.190) = W n : (3.191) 89 Using the Markov inequality for x2 S with V 1 gives P( C Nj X 0 = x) = P tau C N j X 0 = x (3.192) N E h C jX 0 = x i (3.193) N E Z C jX 0 = x (3.194) N E [Z 0 jX 0 = x] (3.195) = N V (x) (3.196) N d: (3.197) Therefore C < N (X 0 = xj =)r. But C is (n 0 ;;())-small so P n 0 (x;)() for x2 C. Thus N+n 0 X i=1+n 0 P i (x;) r() (3.198) for x2 S . Thus N+n 0 X i=n 0 P i (x;) r() (3.199) for x2 S[C. Therefore S[C is petite. Lemma 3.6 and Lemma 3.7 show that C 0 is small. This proves Theorem 3.7. 3.2.5 Central Limit Theorems for Aperiodic-irreducible MCMC Central limit theorems (CLT) specify conditions on statistical processes to converge in distribution to some fixed univariate density. MCMC central limit theorems follow that show weak convergence to a normal density. Definition 3.9. SupposefX n g is a-irreducible and aperiodic Markov chain onX with stationary density(). h satisfies a Central Limit Theorem (or p n-CLT) if P n i=1 h(X i )(h) p n (3.200) 90 converges weakly to the normal density (N 0; 2 ) for some 2 <1 where h :X!R defined by(h) R x2X h(x)( dx) has a finite stationary mean. Chan and Geyer [123, 341] show that the normalized CLT sum 2 = lim n!1 1 n E 2 6 6 6 6 6 6 6 4 0 B B B B B B @ n X i=1 [h(X i )(h)] 1 C C C C C C A 2 3 7 7 7 7 7 7 7 5 =Var(h) (3.201) where = P k2Z Corr(X 0 ; X k ) is the integrated autocorrelation time. Two theorems give CLT results based on the ergodicity of the chain. The following theorem (proof given by [60] in Corollary 4.2(ii)) shows that uniformly ergodic chains with finite second moments have CLTs. Theorem 3.10. SupposefX n g is a Markov chain with stationary density() is uniformly ergodic. A p n-CLT holds for h whenever h 2 <1. A slightly stronger assumption of finite 2+ order moments yields CLTs for chains with only weaker geometric convergence. The following theorem formalizes this requirements without proof but I present a full proof of the subsequent theorem which assumes reversibility to give a similar result. Theorem 3.11. SupposefX n g is a Markov chain with stationary density() is geomet- rically ergodic. A p n-CLT holds for h whenever jhj 2+ <1 for some> 0. Assuming reversibility allows a slight weakening of the finite second order moment assumptions. The proof makes use of a corollary to the following theorem which relies on the existence of a solution to the Poisson equation h(h) = g Pg: (3.202) and the following version of the martingale central limit theorem Theorem 3.12. LetfZ n g be a stationary ergodic sequence with E [Z n jZ 1 ;:::;Z n1 ] = 0 and E h (Z n ) 2 i <1. Then P n i=1 Z i p n converges weakly to a normal densityN 0; 2 for some 2 <1. Theorem 3.13. Let P be a transition kernel for an aperiodic-irreducible Markov chain on state spaceX with stationary density (). Assume the Markov chain is at the 91 stationary density (X 0 ()). Let h :X !R with h 2 <1. Then h satisfies a p n- CLT if there exists g :X !R with g 2 <1 that solves the Poisson equation h(h) = g Pg: (3.203) Proof. Let Z n = g(X n ) Pg n (X n1 ). fZ n g is stationary since X 0 (). fZ n g is also ergodic since the Markov chain asymptotically converges. Note E h Z 2 n i 4 g 2 <1. Also E g(X n ) Pg(X n1 )jX 0 ;:::; X n1 = E g(X n )jX n1 Pg(X n1 ) (3.204) = Pg(X n1 ) Pg(X n1 ) (3.205) = 0: (3.206) Thus E [Z n jZ 1 ;:::;Z n1 ] = 0 since Z 1 ;:::;Z n1 2(X 0 ;:::; X n1 ). Therefore P n i=1 Z i p n (3.207) converges weakly toN 0; 2 by the martingale central limit theorem. The result fol- lows by expanding P n i=1 [h(X i )(h)] p n = P n i=1 g(X i ) Pg(X i ) p n (3.208) = P n i=1 g(X i ) Pg(X i1 ) + Pg(X 0 ) Pg(X n ) p n (3.209) = P n i=1 Z i + Pg(X 0 ) Pg(X n ) p n (3.210) (3.211) since Pg(X 0 ) p n and Pg(X n ) p n both converge in probability as n!1. Corollary 3.2. h satisfies a p n-CLT if 1 X k=0 q P k [h(h)] 2 <1: (3.212) 92 Proof. Let g k (x) = P k h(x)(h) (3.213) = P k [h(h)](x) (3.214) with P 0 h(x) = h(x) by convention. Define g(x) = P 1 k=0 g k (x). Then g Pg(x) = 1 X k=0 g k (x) 1 X k=0 Pg k (x) (3.215) = 1 X k=0 g k (x) 1 X k=1 g k (x) (3.216) = g 0 (x) (3.217) = P 0 h(x)(h) (3.218) = h(x)(h): (3.219) TheL 2 () norm satisfies the triangle inequality so q g 2 1 X k=0 q g 2 k : (3.220) Therefore g 2 <1 since P 1 k=0 p g 2 k <1. The following theorem summarizes the p n-CLT convergence for geometrically ergodic time reversible Markov chains. Theorem 3.14. SupposefX n g is a Markov chain with stationary density() is geomet- rically ergodic and reversible. A p n-CLT holds for h whenever h 2 <1. Proof. LetkPk L 2 () be the usualL 2 () operator norm for P restricted to the functionals f with( f ) = 0 and f 2 <1: kPk L 2 () = sup ( f )=0;( f 2 )=1 P f 2 (3.221) = sup ( f )=0;( f 2 )=1 Z x2X Z x2X f (y) P(x; dy) ! 2 ( dx): (3.222) 93 Theorem 2 of [298] shows that reversible Markov chains are geometrically ergodic if and only if they satisfykPk L 2 () < 1. This means that there exists a< 1 with P f 2 2 f 2 whenever( f ) = 0 and f 2 <1. Also P k L 2 () =kPk k L 2 () (3.223) since P is self-adjoint inL 2 () because it is reversible. Thus h P k f i 2 2k f 2 : (3.224) Now let g k = P k h(h) as in Corollary 3.2. Therefore g 2 k 2k [h(h)] 2 (3.225) so 1 X k=0 q g 2 k q [h(h)] 2 1 X k=0 k (3.226) = q [h(h)] 2 1 (3.227) <1: (3.228) And the proof follows from Corollary 3.2. The following example [297] shows that MCMC does not guarantee convergence for all chains. Example 3.1. Consider the reversible Markov chain r (x) = P(X n+1 = X n j X n = x) with stationary density(). Then a p n-CLT does not hold for h if lim n!1 n [h(h)] 2 r n =1 (3.229) 94 Proof. Using the CLT definition 2 = lim n!1 1 n E 2 6 6 6 6 6 6 6 4 0 B B B B B B @ n X i=1 [h(X i )(h)] 1 C C C C C C A 2 3 7 7 7 7 7 7 7 5 (3.230) lim n!1 1 n E 2 6 6 6 6 6 6 6 4 0 B B B B B B @ n X i=1 [h(X i )(h)] 1 C C C C C C A 2 1(X 0 = X 1 = = X n ) 3 7 7 7 7 7 7 7 5 (3.231) = lim n!1 1 n E h (n[h(X 0 )(h)]) 2 r (X 0 ) n i (3.232) = lim n!1 n[h(h)] 2 r n () (3.233) =1 (3.234) Diagnostics Most MCMC implementations address convergence by using diagnostic tools to analyze the output from the chain to determine whether it is safe to stop sampling [69]. The next section lists a technical overview and theoretical basis for the most common MCMC convergence diagnostics. The Gelman and Rubin diagnostics and Raftery and Lewis diagnostics are the most popular because of their ease of application and availability of tools. Despite the range of tests it is not possible to say with certainty that a particular sample from a finite MCMC algorithm is representative of the true target density. Gelman and Rubin Gelman and Rubin’s [118] method is a two step process based on normal approximations to the exact Bayesian posterior. 1. Sample M points from an overdispersed estimate of the target density(x) before starting the chain. The sampling should focus on areas of high probability and M will likely increase with increasing complexity of the. 2. The algorithm should transform the scalars of interest to approximate normality. It then runs a Gibbs sampler chain using each sample from step 1 as an initial condition for a number of steps (say 2n) for each scalar of interest. 95 The diagnostic uses the last n steps to compute a parallel estimate of the density of the scalar quantity as a t-distribution where the scale parameter is function of the the between-chain variance and within-chain variance. The procedure monitors conver- gence as an estimate of the scaling factor shrinkage if it continued sampling indefinitely p b R = s n 1 n + m+ 1 mn B W ! df df 2 (3.235) where B is the variance between means from the m parallel chains, W is the average of the m within-chain variances, and df is the degrees of freedom of the approximating t-density. The idea is that the between-means variance B will initially be much larger than the within-chain variance W due to the overdispersed initial conditions. Gelman and Rubin propose a method of running additional iterations for the parallel chains and repeating the process to improve the diagnostic. The diagnostic applies directly to the Gibbs sampler but the method generalizes to any MCMC sampler. The major objective of the method is to reduce estimate bias. Gelman and Rubin diagnose convergence as the relative distance of the shrinkage factor from 1 that implies the within-chain variance dominates the between-chain variance. They interpret this to mean that all chain escaped the influence of the starting points and fully sampled the target density. Their chief claim is that they can only determine this property with multiple chains starting from dierent initial values. The method suer from many weaknesses. First it relies on the user finding and sampling a suitable starting distribution. Also the application of a normal approximation is a dubious option since the Gibbs sampler most often finds uses when the normal approximation to the posterior distribution is inadequate. Critics of multiple chains also cite the fact that the algorithm discards a substantial number of early iterations from each as key limiting factor. Instead proponents of single long chains advance that claim that analysis should instead benchmark convergence against chains run M2n time steps instead of against a single 2n chain. Raftery and Lewis Raftery and Lewis propose an alternate single chain method [288] that intends to detect convergence and provide a means to bound estimate variance. The method runs a single- chain Gibbs sample for N min steps where N min is the number of iterations to obtain 96 the desired precision assuming independent samples. The Raftery and Lewis apply a statistical process to the Gibbs chain that require the user to input the desired estimate quantile q, the desired accuracy r, a probability bound of attaining the specific accuracy s, and a convergence tolerance . The process reports the number of iterations to run nprec, how many initial “burn-in” steps to discard nburn, and a sub-sampling factor k that specifies the rate to include samples in parameter estimates. The Raftery and Lewis process applies a two-state Markov chain and uses the stan- dard sample size formulas for binomial variance. The binary-state Markov chainfZg indicates at each iteration whether the computed value of interest is less that a particular cuto or not. It extracts the output parameters as k is the smallest skip-interval so that a k-subsampled Markov-chain n Z (k) o approx- imates that of the parent chain nburn is the number of iterations for n Z (k) o to approach within the parameter of the target density. The relative values of the parameters can suggest slow convergence (large nburn) or strong autocorrelations (nprec N min or k> 1) within the chain. Raftery and Lewis point out that any convergence diagnostic should take into account the estimation of quantiles because they are at the heart of density estimation. They also allow robust estimates of the center and spread of a distribution. Critics note that the Markov-chain depends heavily on the stochastic input and that the “adequate” number of samples depends heavily on the initial conditions. Geweke Geweke applied spectral analysis to assess Gibbs sampler convergence [122]. The method applies specifically to estimates of the mean of a function g of the estimated parameters. It regards the estimates g j after each Gibbs iteration as a time series. Geweke argues that the nature of the MCMC process and reasonable limitations of g lead to the existence of a spectral density S g (!) for g j that is continuous at! = 0. Under this assumption then the asymptotic variance of the Gibbs estimate g n = P n i=1 g( i ) n (3.236) 97 is Var g n = S g (0) n : (3.237) And the square root of Var g n estimates the standard error of the mean. Geweke called the estimate the numeric standard error (NSE). An MCMC algorithm applies the convergence diagnostic after n iterations. It first computes the dierence between two means: g() A n using the first n A samples and g() A n using the last n B samples. It then divides the mean dierence by the asymptotic standard error of the dierence computed from the two time segments. The distribution of the diagnostic converges to a standard normal as n increases assuming fixed ratios n A n and n B n with n A + n B < n by the central limit theorem. The original paper suggests n A = 0:1n and n B = 0:5n. The specification of the spectral window introduces a new sensitivity. And even though the procedure is quantitative the paper does not propose an optimal procedure. Instead it leave the application to the experience and subjective choice of the user. Roberts Roberts method gives a method to asses convergence of the entire joint distribution [Roberts1992]. The method calculates a one-dimensional diagnostic parameter and applies to continuous densities. The method requires a symmetrized Gibbs sampler algorithm where each iteration performs Gibbs updates in a predetermined order with a following iteration that repeats the update in the reverse order. The diagnostic defines a function space and constructs an inner production where the transition operator induced by the kernel of the Gibbs sampler is self-adjoint. The method rests on the theoretical result that k f n fk n!1 ! 0 (3.238) wherekk is the norm induced by the inner product, f n is the pdf of values generated at the n Gibbs step and f is the target density. 98 The diagnostic J n = 1 m(m 1) X l,p k ( 1 2 ) l ; (2n1) p f (2n1) p (3.239) requires m parallel reversible Gibbs sampler chains with the same initial condition 0 where l and p are replications of the sampler ( 1 2 ) l is the value for after the “forward” half of the first iterator of the lth chain, (n) p is the value for after the nth complete iteration of the pth chain, and k is the kernel of the “backward” half of the reversible sampler. Roberts proposed evaluating monotonic convergence graphically. In the usual Gibbs sampler the algorithm does not generally have access to the normalizing constant for the density f so the value the diagnostic converges to is also unknown. So the diagnostic indicates watching only for stabilization of values since it does not estimate k f n fk+ 1. Roberts improved the diagnostic in a later paper by forming lp n = k (0) l ; (n) p f (n) p (3.240) with l n = ll n . He then defined the within-chain dependence term D n = 1 m P m l=1 l n and the between-chain interaction term I n = 1 m(m1) P l,p lp n . He showed that E (I n ) E (D n ) and Var(I n ) Var(D n ) and most importantly E (I n ) = E (D n ) at convergence. Thus he modified the method to choose m = 10 or 20 dierent starting points and monitoring I n and D n until both series are stationary with similar locations. The method still required the use of a reversible sampler but did note several reduced assumptions that would preserve the required self-adjoin Gibbs transition kernel. The next section reviews aspects Markov chains that underlie the MCMC algorithm [296]. It then introduces two important MCMC special cases: the Metropolis-Hastings algorithm and the Gibb’s Sampler. The section presents Hastings’ [146] generalization of the MCMC Metropolis algo- rithm now called Metropolis-Hastings. This starts with the classical Metropolis algo- rithm [239]. 99 3.2.6 Noise Benefits in MCMC Algorithms This chapter presents results that show noise benefit within the classic Metropolis- Hastings MCMC algorithm. MCMC algorithms can estimate solutions to numerical problems with unmanageably many degrees of freedom and also to combinatorial prob- lems of factorial size. But A key weakness of MCMC algorithms are their slow con- vergence speeds in some applications. The tremendous size of the problems leads to a sample space that take may require impractical time constraints to fully explore. Depen- dence on initial conditions and serial correlation between samples can exacerbate this weakness. Judicious noise injection can increase the convergence speed by ensuring that noisy samples are on average are more like the target density than noiseless sam- ples. Theorem 3.15 states a sucient condition for an MCMC noise benefit. Like the recently proposed expectation-maximization (EM) noise benefit the MCMC noise benefit does not involve a signal threshold. This is unlike almost all previously described SR noise benefits. I derive an restatement of the theorem in corollary 3.4. The corollary gathers the respective noise terms to one side of the inequality and the condition reduces to a simple inequality between the noisy and noiseless sample jump functions. Further corollaries show that the noise benefit extends beyond additive noise to more general functions of the current state and noise. I present two specific applica- tions of the theorem under two common jump functions: Gaussian and Cauchy. I show that the N-MCMC inequality results in a simple algebraic condition between the noise and jump density parameters. 3.3 Noisy MCMC Theorems The Noisy MCMC theorems stem from simple intuition: Find a noise sample n that makes the next choice of location more probable. Define the usual jump function Q(yjx) as the probability that the system moves or jumps to state y if it is in state x. The Metropolis algorithm requires a symmetric jump function Q(yjx) = Q(xjy): (3.241) 100 Neither the Metropolis-Hastings algorithm nor the Noisy MCMC theorems require sym- metry. But all MCMC algorithms require that the Markov chain is reversible (or the chain satisfies detailed balance). Q(yjx)(x) = Q(xjy)(y) (3.242) for all x and y. Now consider a noise sample n that makes the jump more probable Q(yjx + n) Q(yjx): (3.243) This is equivalent to ln Q(yjx + n) Q(yjx) 0: (3.244) Replace the denominator jump term with its symmetric dual Q(xjy). Then eliminate this term with detailed balance and rearrange to get the key inequality for the noise boost ln Q(yjx + n) Q(yjx) ln (x) (y) : (3.245) Taking expectations over the noise random variable N and over X gives a simple sym- metric version of the sucient condition in the Noisy MCMC Theorem for a noise benefit: E N;X " ln Q(yjx + n) Q(yjx) # E X " ln (x) (y) # : (3.246) This average noise benefit appears by comparing two relative-entropy pseudo- distances. The relative-entropy (or Kullback-Leibler divergence) D f 0 (x) f 1 (x) = Z X ln f 0 (x) f 1 (x) f 0 (x) dx (3.247) = E f 0 ln f 0 (x) ln f 1 (x) (3.248) is a non-symmetric measure (pseudo-distance) of the dierence between two pdfs f 0 (x) and f 1 (x) on X. It is a measure of the information lost when using the density f 1 (x) to 101 describe samples from f 0 (x). The relative-entropy pseudo-distance between the target pdf and the sample pdf in the noiseless context is d t = D (x) Q(yjx) : (3.249) And the relative-entropy pseudo-distance between the target pdf and the noisy sample pdf in the N-MCMC context is d t (N) = D (x) Q(yjx + N) : (3.250) The average noise benefit occurs when the target density (x) is closer in relative- entropy to the noisy sample pdf Q(yjx t + N) than the noiseless sample pdf Q(yjx t ). So the noise benefits occur when d t (N) d t (3.251) on average. 3.4 Noisy Markov Chain Monte Carlo The Noisy MCMC Theorem follows. It provides a sucient condition for injected noise to induce a benefit. It uses the Metropolis-Hastings setup of drawing samples from a target density (x). It constructs a Markov random-walk using the current state the basis for the next sample draw via a jump function q(y) x t . The theorem injects noise by perturbing x t prior to drawing the sample x t+1 . Theorem 3.15 (Noisy Markov Chain Monte Carlo Theorem). Suppose that Q(yjx t ) is a Metropolis-Hastings jump pdf for time t and that satisfies detailed balance (x t+1 ) Q(yjx t ) =(x t ) Q(yjx t+1 ) for the target equilibrium pdf(x) . Then an MCMC noise benefit d t (N) d t occurs on average if E N;X " ln Q(yj x t + N) Q(yj x t ) # E N " ln (x t + N) (x t ) # (3.252) 102 (a) no noise (b) noise U[0;1] Figure 3.1: The Noisy MCMC Theorem shows that noise can reduce the relative-entropy pseudo-distance between the sample pdf and the target pdf. The figure illustrates the key intuition behind the Noisy Markov Chain Monte Carlo Theorem (N-MCMC). That noise boosts sampling by forcing the sample pdf to more closely resemble the target pdf. The plots show the Kullback-Leibler (KL) divergence between the MCMC sampling distribution and the target density in standard (no noise) MCMC and noisy MCMC. Both simulations drew MCMC samples from a standard normal density. They used a standard normal jump pdf and assumed x t = 1. (a) The KL-divergence is 0.535 in the standard (no noise) MCMC algorithm. (b) Noise decreases the KL-divergence to 0.121. This shows that the noisy MCMC samples are better estimates of samples from the target pdf. where d t = D (x) Q(yj x t ) , d t (N) = D (x) Q(yj x t + N) , N f Njx t (njx t ) is noise that may depend on x t , and D is the relative-entropy pseudo-distance: D P Q = R X p(x)ln p(x) q(x) dx. 103 Proof. d t = Z X ln (x) Q(yjx t ) (x) dx (3.253) = E X " ln (x) Q(yjx t ) # (3.254) d t (N) = Z X ln (x) Q(yjx t + N) (x) dx (3.255) = E X " ln (x) Q(yjx t + N) # (3.256) Take expectations over N: E N [d t ] = d t and E N [d t (N)] = E N [d t (N)]. Then d t (N) d t guarantees that a noise benefit occurs on average: E N [d t (N)] d t . Suppose E N " ln (x t + N) (x t ) # E N;X " ln Q(yj x t + N) Q(yj x t ) # : (3.257) Expand Z N ln (x t + n) (x t ) f Njx t (njx t ) dn " N;X ln Q(yj x t + n) Q(yj x t ) (x) f Njx t (njx t ) dx dn (3.258) and then split the log ratios: Z N ln(x t + n) f Njx t (njx t ) dn Z N ln(x t ) f Njx t (njx t ) dn " N;X ln Q(yj x t + n)(x) f Njx t (njx t ) dx dn " N;X ln Q(yj x t )(x) f Njx t (njx t ) dx dn: (3.259) 104 Reorder the terms and factor the pdfs Z N ln(x t + n) f Njx t (njx t ) dn " N;X ln Q(yj x t + n)(x) f Njx t (njx t ) dx dn Z N ln(x t ) f Njx t (njx t ) dn " N;X ln Q(yj x t )(x) f Njx t (njx t ) dx dn: (3.260) Thus " N;X ln(x t + n)(x) f Njx t (njx t ) dx dn " N;X ln Q(yj x t + n)(x) f Njx t (njx t ) dx dn " N;X ln(x t )(x) f Njx t (njx t ) dx dn Z N;X ln Q(yj x t )(x) f Njx t (njx t ) dx dn: (3.261) Then " N;X ln (x t + n) Q(yj x t + n) (x) f Njx t (njx t ) dx dn Z N f Njx t (njx t ) dn | {z } =1 Z X ln (x t ) Q(yj x t ) (x) dx: (3.262) Apply the MCMC detailed balance condition(x t+1 ) Q(yjx t ) =(x t ) Q(yjx t+1 ): " N;X ln (x t + n) (x t +n)Q(yj x t +n) (x) (x) f Njx t (njx t ) dx dn Z X ln (x t ) (x t )Q(yj x t ) (x) (x) dx: (3.263) 105 Simplifying gives " N;X ln (x) Q(yj x t + n) (x) f Njx t (njx t ) dx dn Z X ln (x) Q(yj x t ) (x) dx: (3.264) Then Z N "Z X ln (x) Q(yj x t + n) (x) dx # f Njx t (njx t ) dn Z X ln (x) Q(yj x t ) (x) dx (3.265) i Z N D (x) Q(yj x t + n) f Njx t (njx t ) dn D (x) Q(yj x t ) : (3.266) Therefore Z N d t (N) f Njx t (njx t ) dn d t : (3.267) And thus E N [d t (N)] d t : (3.268) The theorem shows how noise can impart a benefit to the Markov chain convergence speed by forcing the sampling distribution of x t+1 to better match the target density(x). Over successive time steps this results in a sample sequence where each sample is better on average. This leads to reduced correlation between the samples and reduces burn-in. In the context of a fixed number of time steps the theorem provides a secondary noise noise benefit. While the algorithm will converge faster on average the theorem shows that the chain will use fewer samples to obtain the same density estimate with 106 the same predictive power. This implies that noisy applications that satisfy the theo- rem will provide better estimates of the target density than the corresponding noiseless application. 3.4.1 Generalized Noise Benefits Extend the Additive N-MCMC Result The relative-entropy noise benefit allows for a more general inclusion of noise into the model by replacing additive noise x t + N with an arbitrary noise injection func- tion g(x t ; N). This extends the result to multiplicative noise. The following corollary summarizes the generalized noise benefit condition. Corollary 3.3 (Generalized Noisy MCMC). Suppose that Q(yjx t ) is a Metropolis- Hastings jump pdf for time t and that satisfies detailed balance with (x) . Then an MCMC noise benefit occurs on average if E N;X " ln Q(yj g(x t ; N)) Q(yj x t ) # E N " ln (g(x t ; N)) (x t ) # : (3.269) Proof. Replace x t + N with g(x t ; N) in Equation (3.255). The result follows exactly as in the proof of Theorem 3.15. 3.4.2 Noisy Metropolis-Hastings with Gaussian or Cauchy Jump Densities The sequential nature of MCMC methods obfuscates much of the theoretical analysis of the algorithm. Analysis depends upon a number of parameters that include the jump distribution character, measures for burn-in, and convergence diagnostics. Selecting a jump distribution is perhaps the most crucial aspect in implementing an MCMC algo- rithm. The careful link through detailed balance between Q(yjx) and (x) means that exotic jump densities are generally out of the question. While such jump distributions may provide some practical benefit they added opaqueness and a potential weakening of MCMC convergence guarantees limits their actual use. Thus applications often employ one of several standard MCMC implementations. 107 This section considers two specialized jump densities: Q(yjx t ) Normal x t ; 2 and Q(yjx t ) Cauchy(x t ;d). The first corollary derives an alternate representation of the MCMC noise benefit condition. Two corollaries follow that reduce the N-MCMC suf- ficient condition assuming a Gaussian jump pdf. The first corollary uses additive noise e x t = x t + n similar to Theorem 3.15. The second shows how to apply the Generalized Noisy MCMC corollary (Corollary 3.3) to the case of multiplicative Gaussian noise e x t = n x t . The final result in this section derives an algebraic condition that implies a noise benefit in the case of a Cauchy jump pdf corresponding to fast MCMC. The first corollary derives an alternate representation of the MCMC noise bene- fit condition. This auxiliary form frames the noise benefit condition as an inequality between the noisy jump function Q(yjx t + n) and noiseless jump function Q(yjx). It exploits the fact that the right hand side of (3.272) is constant given the current state x t : E N " (x t + N) (x t ) # = A: (3.270) Corollary 3.4 (Dominated N-MCMC Density Condition). The N-MCMC noise benefit condition E N;X " ln Q(yj x t + N) Q(yj x t ) # E N " ln (x t + N) (x t ) # (3.271) holds if Q(yjx t + n) e A Q(yjx t ) (3.272) for almost all x and n where A = E N " ln (x t + N) (x t ) # : (3.273) Proof. The following inequalities need hold only for almost all x and n: Q(yjx t + n) e A Q(yjx t ) (3.274) 108 if and only if (i) ln Q(yjx t + n) A+ ln Q(yjx t ) (3.275) i ln Q(yjx t + n) ln Q(yjx t ) A (3.276) i ln Q(yjx t + n) Q(yjx t ) A: (3.277) Thus E N;X " ln Q(yj x t + N) Q(yj x t ) # E N " ln [x t + N] [x t ] # (3.278) This simplified sucient conditions leads to three specialized jump pdf corollarys. Applications usually select one of a few standard jump distributions. The standard dis- tributions are generally symmetric and so that they implicitly satisfy the detailed balance condition with any target density. The first uses a Gaussian jump density to drive the random walk over the target density surface. The thin tails of the Gaussian pdf give the Markov chain superior local search characteristics. But this also limits the power of the density to complex multi-modal surfaces because jumps from one local minima to another are rare. Fast MCMC methods often relay on the thicker tailed Cauchy pdf. The Cauchy pdf is the only symmetric alpha-stable density with a closed form. While the Cauchy pdf closely resembles the Gaussian bell curve the Cauchy has fatter tails that mean that it occasionally produces impulsive samples far from the origin. This allows fast MCMC algorithms to trade some of their local search capability for occasional long range flights to other regions of the sample space. 3.4.3 MCMC with Gaussian Jump Densities Most MCMC implementations use a Gaussian jump density to drive the random walk because it is easy to compute and provides the same eventual convergence guarantees 109 of other jump functions. The following theorem derives as a special case an algebraic condition that is equivalent to the noise benefit inequality in the context of a Gaussian jump pdf. Corollary 3.5. Suppose Q(yjx t ) N x t ; 2 . Then the sucient noise benefit condi- tion (3.272) holds if n(n 2(x t x))2 2 : (3.279) Proof. Assume Q(yjx t ) = 1 p 2 e (xx t ) 2 2 2 . Then Q(yjx t + n) e A Q(yjx t ) (3.280) i 1 p 2 e (xx t n) 2 2 2 e A 1 p 2 e (xx t ) 2 2 2 (3.281) i e (xx t n) 2 2 2 e A (xx t ) 2 2 2 (3.282) i (x x t n) 2 2 2 A (x x t ) 2 2 2 (3.283) i (x x t n) 2 2 2 A (x x t ) 2 (3.284) i x 2 + 2xx t + 2xn x 2 t 2x t n n 2 2 2 A x 2 + 2xx t x 2 t (3.285) 110 i 2xn 2x t n n 2 2 2 A (3.286) n(n 2(x t x))2 2 A: (3.287) The results thus far focused on additive noise benefits. This fact springs from the original descriptions of stochastic resonance noise benefits which studied systems with periodic forcing functions and additive perturbations. Corollary 3.3 shows that MCMC noise benefits are not exclusive to additive noise. The next theorem shows that multi- plicative noise elicits a similar algebraic condition to the previous corollary but under the assumption of multiplicative noise. Corollary 3.6. Suppose Q(yjx t ) N x t ; 2 and g(x t ;n) = nx t . Then the sucient noise benefit condition (3.272) holds if nx t (nx t x t ) 2x t (x x t )2 2 A: (3.288) Proof. Assume Q(yjx t ) = 1 p 2 e (xx t ) 2 2 2 . Then Q(yjnx t ) e A Q(yjx t ) i 1 p 2 e (xnx t ) 2 2 2 e A 1 p 2 e (xx t ) 2 2 2 (3.289) i e (xnx t ) 2 2 2 e A (xx t ) 2 2 2 (3.290) i (x nx t ) 2 2 2 A (x x t ) 2 2 2 (3.291) i (x nx t ) 2 2 2 A (x x t ) 2 (3.292) 111 i x 2 + 2xnx t n 2 x 2 t 2 2 A x 2 + 2xx t x 2 t (3.293) i 2xnx t n 2 x 2 t 2xx t + x 2 t 2 2 A (3.294) i nx t (nx t x t ) 2x t (x x t )2 2 A: (3.295) This corollary shows that noise benefits can accrue from many dierent injection methods. This also has the application that physical implementations of these algorithms may be able to harness dierent modes of environmental noise to realize a noise benefit. 3.4.4 MCMC with Cauchy Jump Densities The last corollary in this section derives a noise benefit for fast MCMC algorithms. Fast MCMC algorithms use jump densities with super-Gaussian tails. This means that they give up some of their local search character in exchange for additional breadth char- acter. Fast MCMC algorithms may provide crucial methods to handling the increased dimension and scale of modern MCMC problems. Even though very little research has focused on fast MCMC methods they have the benefit of the huge class of convergence and performance guarantees provided by the general MCMC methods. This corollary provides the first statement of a noise benefit in a Cauchy MCMC algorithm. The statement lives as an algebraic condition similar to the Gaussian results in corollary 3.5 and corollary 3.6. Corollary 3.7. Suppose Q(yjx t ) Cauchy(m;d). Then the sucient condition (3.272) holds if n 2 + 2n(x x t ) e A 1 d 2 + (x x t ) 2 : (3.296) 112 Proof. Q(yjx t ) = 1 d 1+ xm d 2 (3.297) Therefore Q(yjx t + n) e A Q(yjx t ) (3.298) i 1 d 1+ xx t n d 2 e A 1 d 1+ xx t d 2 (3.299) i 1+ x x t n d 2 e A " 1+ x x t d 2 # (3.300) e A + e A x x t d 2 (3.301) i x x t n d 2 e A x x t d 2 e A 1 (3.302) i (x x t n) 2 e A (x x t ) 2 d 2 e A 1 (3.303) (x x t ) 2 + n 2 + 2n(x x t ) e A (x x t ) 2 (3.304) 1 e A (x x t ) 2 + n 2 + 2n(x x t ) (3.305) i n 2 d 2 e A 1 + e A 1 (x x t ) 2 2n(x x t ) (3.306) e A 1 d 2 + (x x t ) 2 2n(x x t ) (3.307) 113 The MCMC noise benefit condition emerges by analysis of the single step relative- entropy increase between noiseless MCMC sample to noisy MCMC samples. Introduc- ing the sucient condition in corollary 3.4 allows one to derive noise benefit conditions by direct manipulation of Q(y) x t . Thus similar algebraic conditions exist for a large class of jump pdfs including Q(yjx t ) Uniform[x t a; x t + a] and bimodal mixture of two normal distributions. 3.5 The Noisy Metropolis-Hastings Algorithm The Noisy Metropolis Hastings (MH) algorithm injects noise into the classical Metropolis-Hastings algorithm before it determines whether to accept the sample. The algorithm augments the existing MH jump function Q x t+1 x t with noise that satisfies the N-MCMC condition. Algorithm 3.1 The Noisy Metropolis Hastings Algorithm 1: procedure NoisyMetroloplisHastings(X) 2: x 0 Initial(X) 3: for t 0; N do 4: x t+1 Sample(x t ) 5: procedure Sample(x t ) 6: x t+1 x t + JumpQ(x t ) + Noise(x t ) 7: x t+1 (x t ) 8: if> 1 then 9: return x t+1 10: else if Uniform[0;1]< then 11: return x t+1 12: else 13: return x t 14: procedure JumpQ(x t ) 15: return y Q(yjx t ) 16: procedure Noise(x t ) 17: return y f (yjx t ) Figure 3.2 shows a single-step comparison of the Noisy-MH algorithm to the classical MH algorithm. The algorithm used a standard normal equilibrium density (x)N (0;1). The simulation repeated 100000 Metropolis-Hastings single steps with 114 initial condition x 0 = 1. The noisy simulations injected uniform random noise U [a;b] for 0<jaj<jbj< 6. The planar cross section indicates the no-noise null hypothesis that implies the pseudo-distance in the noisy algorithm does not dier from the pseudo- distance in the non-noisy algorithm. Figure 3.2: MCMC noise benefit (KL noise KL 0 ) as a function of noise parameters. The figure shows how noise confers a relative improvement upon the KL-divergence of MCMC simulations. The simulation injected uniform noise n t ()U (a;b) where 0 jaj<jbj 6 (and a< b) into MCMC sampling from a standard normal target pdf. The planar cross section represents standard (no noise) MCMC performance. Points above the cross section correspond to a noise benefit. Points below the cross section correspond to a noise harm. 115 Chapter 4 Noisy Markov Simulated Annealing (N-SA) This chapter presents results that show how carefully injected noise can speed the con- vergence of Simulated Annealing algorithms. The major result is the noise-injected ver- sion of the simulated annealing (SA) algorithm: the Noisy Simulated Annealing (NSA) algorithm. The NSA algorithm uses noise to preferentially sample high probability regions of the sample space and also to occasionally accept less optimal solutions if they may lead to increased search breadth. The NSA theorem proves that noise can increase the acceptance probability of SA sampling subject to a positivity condition on a ratio of the noisy and noiseless versions of the target density. A corollary shows that the noise-benefit extends to convex mappings of the Boltzmann factor relation between the potential energy surface and the occupancy probability density. A second corollary gives a simplified noise-benefit condition when annealing exponential family surfaces. The next section presents an adaptation of the Metropolis-Hastings MCMC algo- rithm for global optimization called simulated annealing. Kirkpatrick [189] describes the thermodynamically inspired algorithm as a method to find optimal layouts for VLSI circuits. 4.1 Classical Simulated Annealing Suppose we want to find the global minimum for a cost function C (x). C (x) is also called a potential energy surface. Simulated annealing maps the cost function to a prob- ability density via the Boltzmann factor e p(x t )/ exp " C (x t ) kT # (4.1) 116 and then performs the Metropolis-Hastings algorithm with the e p(x t ) in place of the probability density p(x). This operation preserves the Metropolis-Hastings framework because e p(x t ) is an unnormalized probability density. SA recasts the Metropolis-Hastings formulation as a thermodynamic hopping pro- cess. Simulated annealing maps the target cost function C (x) to a potential energy surface through the Boltzmann factor e p(x t )/ exp " C (x t ) kT # (4.2) and then performs the Metropolis-Hastings algorithm with e p(x t ) in place of the prob- ability density p(x). Figure 1.5 illustrates how SA uses a notion of slowly decreasing temperature to constrain samples to decreasing energy regions. For temperatures near absolute zero the algorithm only permits state changes that lower the estimates energy. The current state becomes locked within successively lower energy minimas as the algo- rithm proceeds. MCMC algorithms require closed and bounded sample spaces and so the decreasing energy estimates eventually attain the global optimum. Note that the algorithm can optimize for functions of the global minimum by considering the mini- mum of functions of C (x) such as find the maximum of C (x) by finding the minimum ofC (x). SA algorithms use a cooling schedule to update the temperature. The algorithm slowly cools the system according to a cooling schedule T (t). As T decreases the algo- rithm reduces the probability of accepting candidate points that have higher energy than the current state. The algorithm provably attains a global minimum in the t!1 limit but this requires an extremely slow T (t)/ log(t + 1) cooling. Accelerated cooling schedules such as geometric T (t)/ t or exponential T (t)/ exp d p t often yield satisfactory approximations in practice. This means that an increase in estimator power scales as a time power law. 1. Choose an initial x 0 with C (x 0 )> 0 and initial temperature T 0 . 2. Generate a candidate x t+1 by sampling from the jump distribution Q(yjx t ). 3. Compute the Boltzmann factor = exp C x t+1 C(x t ) kT ! . 117 4. Accept the candidate point (x t+1 = x t+1 ) if the jump decreases the energy. Also accept the candidate point with probability if the jump increases the probability. Else reject the jump (x t+1 = x t ). 5. Update the temperature T t = T (t). T (t) is usually a monotonic decreasing func- tion. 6. Return to step 2. The next section presents the major simulated annealing result that shows a noise benefit in classical simulated annealing. The result shows how noise injection can increase the simulated annealing acceptance probability. The eciency of the accep- tance criteria is critical to the performance of annealing algorithms. 4.2 Noisy Simulated Annealing Theorems The Noisy Simulated Annealing Theorem stems from a simple intuition: Find a noise sample n that increases acceptance probability of the next choice of location. Simulated annealing works under the customary Metropolis-Hastings framework. Define the usual jump function Q(yjx) as the probability that the system moves or jumps to state y if it is in state x. Simulated annealing requires that the jump function meet the usual MCMC constraints such as reversibility and irreducibility. Now consider a noise sample n that increases the probability of accepting the jump (x t+1 + njx t )(x t+1 jx t ) (4.3) where (x t+1 jx t ) = min ( 1;exp E T !) (4.4) = min ( 1; (x t+1 ) (x t ) ) (4.5) 118 defines the simulated annealing acceptance probability of the step to x t from x t1 given a fixed the system temperature T. This is equivalent to ln (x t+1 + njx t ) (x t+1 jx t ) 0: (4.6) Apply the definition of and rearrange to get the key inequality for the noise boost ln (x t + n) (x t ) : (4.7) Taking expectations over the noise random variable N gives a simplified version of the NSA Theorem noise benefit inequality E N " ln (x t + N) (x t ) # 0: (4.8) The noisy SA Theorem follows. It provides a sucient condition for injected noise to induce a benefit. The NSA Theorem assumes the usual MH jump Q x t+1 jx t function to generate candidate samples from a target energy surface C (x). It then uses the Boltz- mann factor to compute an acceptance probability given the candidate energy C x t+1 and the the current energy C (x t ). The theorem injects noise by perturbing x t prior to drawing the sample x t+1 . Theorem 4.1 (Noisy Simulated Annealing Theorem). Suppose C (x) is an energy sur- face with occupancy probabilities given by (x;T)/ exp C(x) T . Then a simulated- annealing noise-benefit E N [ N (T)](T) (4.9) occurs on average if E N " ln (x t + N;T) (x t ;T) # 0 (4.10) 119 where(T) is the simulated annealing acceptance probability from state x t to the can- didate x t+1 that depends on a temperature T (governed by the cooling schedule T (t)): (T) = min ( 1;exp E T !) (4.11) and E = E t+1 E t = C x t+1 C (x t ) is energy dierence of states x t+1 and x t Proof. The proof uses Jensen’s inequality for concave function g [107] g(E [X]) E g(x) (4.12) for random variable X. The natural logarithm is concave and thus ln E [X] E [ln X]: (4.13) Then (T) = min ( 1;exp E T !) (4.14) = min ( 1;exp E t+1 E t T !) (4.15) = min 8 > > > > > < > > > > > : 1; exp E t+1 T exp E t T 9 > > > > > = > > > > > ; (4.16) = min 8 > > > < > > > : 1; 1 Z x t+1 ;T 1 Z (x t ;T) 9 > > > = > > > ; (4.17) = min 8 > > > < > > > : 1; x t+1 ;T (x t ;T) 9 > > > = > > > ; (4.18) where the normalizing constant Z = Z X exp C (x) T ! dx: (4.19) 120 Let N be a noise random variable that perturbs the candidate state x t+1 . To show E N [ N (T)] = E N 2 6 6 6 6 6 6 6 4 min 8 > > > < > > > : 1; x t+1 + N;T (x t ;T) 9 > > > = > > > ; 3 7 7 7 7 7 7 7 5 (4.20) min 8 > > > < > > > : 1; x t+1 ;T (x t ;T) 9 > > > = > > > ; (4.21) =(T) (4.22) it suces to show that E N 2 6 6 6 6 6 6 4 x t+1 + N;T (x t ;T) 3 7 7 7 7 7 7 5 x t+1 ;T (x t ;T) (4.23) i E N h x t+1 + N;T i x t+1 ;T (4.24) since(x t ) 0 because is a pdf. Suppose E N " ln (x t + N) (x t ) # 0: (4.25) Then E N [ln(x t + N) ln(x t )] 0 (4.26) i E N [ln(x t + N)] E N [ln(x t )] (4.27) i (Jensen’s inequality) ln E N [(x t + N)] E N [ln(x t )] (4.28) 121 i ln E N [(x t + N)] Z N ln(x t ) f N (njx t ) dn (4.29) i ln E N [(x t + N)] ln(x t ) Z N f N (njx t ) dn | {z } =1 (4.30) i ln E N [(x t + N)] ln(x t ) (4.31) i E N [(x t + N)](x t ) (4.32) The theorem attributes the SA noise benefit to the increased acceptance probability of candidate samples. But this increase does not spring from a näive blind increase to the acceptance rate. Instead the theorem uses the same SA acceptance criteria but instead modifies the samples so that they are more likely to meet the threshold. The upshot a SA algorithm proposes better samples at each iteration if it injects noise that satisfies the NSA Theorem inequality on average. The next corollary shows that the noise benefit condition simplifies in the case of an exponential family target density(x). Corollary 4.1. Suppose(x) = A e g(x) where A is normalizing constant such that A = 1 R X e g(x) dx . Then there is an N-SA Theorem noise benefit if E N g(x t + N) g(x t ) (4.33) Proof. Suppose E N g(x t + N) g(x t ) (4.34) 122 Then E N h lne g(x t +N) i lne g(x t ) (4.35) i E N h ln e g(x t +N) i ln e g(x t ) (4.36) E N " ln (x t + N) A # ln (x t ) A (4.37) E N " ln (x t + N) A ln (x t ) A # 0 (4.38) E N 2 6 6 6 6 6 6 6 4 ln (x t +N) A (x t ) A 3 7 7 7 7 7 7 7 5 0 (4.39) E N " ln (x t + N) (x t ) # 0 (4.40) The simplified noise-benefit condition in corollary 4.1 depends only on the exponent g(x). It shows that parametric assumptions about (x) can simplify the noise benefit constraint. In the context of an exponential family target density the condition reduces to a simple threshold against a constant. The next section shows how the noise benefit sucient condition generalizes to convex functions of the ratio of occupancy probabili- ties. 4.2.1 Noisy Simulated Annealing with Convex Increasing Cost- Probability Maps This section presents a corollary that shows that the noise benefit condition generalized to convex increasing functions of the occupancy density ratio. The Boltzmann factor 123 connects the sample probability and sample energy. The exponential form oers a gen- eralization because of the equivalence between the Boltzmann factor for the energy dif- ference and the ratio of occupancy densities (probabilities) that correspond to the energy dierence exp C (x t )C (x t+1 ) T ! / exp(E)/ (x t+1 ) (x t ) : (4.41) Corollary 4.2. Suppose m is an convex increasing function. Then an N-SA Theorem noise benefit E N N (T) (T) (4.42) occurs on average if E N " ln (x t + N;T) (x t ;T) # 0 (4.43) where is the acceptance probability from state x t to the candidate x t+1 : (T) = min 8 > > > < > > > : 1;m 0 B B B B B B @ x t+1 ;T (x t ;T) 1 C C C C C C A 9 > > > = > > > ; : (4.44) Proof. To show E N N (T) = E N 2 6 6 6 6 6 6 6 4 min 8 > > > < > > > : 1;m 0 B B B B B B @ x t+1 + N;T (x t ;T) 1 C C C C C C A 9 > > > = > > > ; 3 7 7 7 7 7 7 7 5 (4.45) min 8 > > > < > > > : 1;m 0 B B B B B B @ x t+1 ;T (x t ;T) 1 C C C C C C A 9 > > > = > > > ; (4.46) =(T) (4.47) it suces to show that E N 2 6 6 6 6 6 6 4 m 0 B B B B B B @ x t+1 + N;T (x t ;T) 1 C C C C C C A 3 7 7 7 7 7 7 5 m 0 B B B B B B @ x t+1 ;T (x t ;T) 1 C C C C C C A : (4.48) 124 Suppose E N " ln (x t + N;T) (x t ;T) # 0: (4.49) Then as in the N-SA Theorem proof E N [(x t + N)](x t ): (4.50) Thus E N [(x t + N)] (x t ;T) (x t ) (x t ;T) (4.51) because(x t ) 0 since is a pdf, and E N " (x t + N) (x t ;T) # (x t ) (x t ;T) : (4.52) So m E N " (x t + N) (x t ;T) #! m (x t ) (x t ;T) ! (4.53) and E N " m (x t + N) (x t ;T) !# m (x t ) (x t ;T) ! (4.54) since m is increasing and convex (Jensen’s inequality). The next section introduces the noisy simulated annealing algorithm. The NSA algo- rithm describes how to augment simulated annealing stochastic optimization to include beneficial noise. 4.3 The Noisy Simulated Annealing Algorithm The NSA algorithm describes how to augment simulated annealing stochastic optimiza- tion to include beneficial noise. 125 Algorithm 4.1 The Noisy Simulated Annealing Algorithm 1: procedure NoisySimulatedAnnealing(X, T 0 ) 2: x 0 Initial(X) 3: for t 0; N do 4: T Temp(t) 5: x t+1 Sample(x t ;T) 6: procedure Sample(x t , T) 7: x t+1 x t + JumpQ(x t ) + Noise(x t ) 8: x t+1 (x t ) 9: if 0 then 10: return x t+1 11: else if Uniform[0;1]< exp(=T) then 12: return x t+1 13: else 14: return x t 15: procedure JumpQ(x t ) 16: return y Q(yjx t ) 17: procedure Noise(x t ) 18: return y f (yjx t ) 4.4 Applications Simulations show that optimal noise gave a 76% speed-up in finding the global min- imum in the Schwefel optimization benchmark. The noise boosted simulations found the global minimum in 99.8% of trials compared with 95.4% in non noise-boosted sim- ulated annealing. The simulations also show that the noise boost is robust to acceler- ated cooling schedules and noise decreases convergence times by more than 32% under aggressive geometric cooling. Molecular dynamics simulations showed that optimal noise gave a 42% speed-up in finding the minimum potential energy configuration of an 8 argon atom gas system driven by a Lennard-Jones 12-6 potential. 126 4.4.1 Noise improves complex optimization The first simulation shows a noise benefit in simulated annealing on a complex cost function. The Schwefel function [316] is a standard optimization benchmark because it has many local minimas and a single global minimum given by f (x) = 419:9829d d X i=1 x i sin p jx i j where d is the dimension over the hypercube500 x i 500 for i = 1;:::;d. The Schwe- fel function has a single global minimum f (x min ) = 0 at x min = (420:9687;:::;420:9687). Figure 4.1 shows a representation of the surface for d = 2. The simulation used a zero mean Gaussian jump distribution with jump = 5 and zero mean Gaussian noise distribution with 0< noise 5. Figure 4.3.(a) shows that noisy simulated annealing converges 76% faster than standard simulated annealing when using log-cooling. Figure 4.3.(b) shows that estimated global minimum from noisy simulated annealing is almost 2 orders of magnitude better than non noisy simulations on average (0.05 vs 4.6). The simulation annealed a 5-dimensional Schwefel surface. It estimated the minimum energy configuration and averaged the result over 1000 trials. We define the convergence time as the number of steps the simulation required to reach the global minimum energy within 10 3 : j f (x t ) f (x min )j 10 3 : (4.55) Figure 4.2 shows projections of trajectories from a simulation without noise (a) and a simulation with noise (b). We initialized each simulation with the same x 0 . The figure shows the global minimum circled in red (lower left). It shows that noisy simulated annealing boosted the sequences through more local minimas while the no-noise simu- lation could not escape cycling between three local minimas. Figure 4.3.(c) shows that the noise decreases the failure rate of the simulation. We defined a failed simulation as a simulation that did not converge before t< 10 7 . For zero noise simulations the failure rate was 4.5%. Even moderate noise brought the failure rate to less than 1 in 200 (< 0:5%). Figure 4.4 shows that noise also boosts simulated annealing with accelerated cool- ing schedules. Noise reduced convergence time by 40:5% under exponential cooling 127 and 32:8% under geometric cooling. Across all noise levels the simulations attained comparable solution error and failure rate (0:05%) so we do omit figures. Figure 4.1: Schwefel function f (x) = 419:9829d P d i=1 x i sin p jxj is a d-dimensional optimization benchmark on the hypercube512 x i 512 [316, 249, 364, 81]. It has a single global minimum x min (=)0 at x m in = (420:9687;:::;420:9687). The surface contains irregular troughs separated by energy peaks. This leads to estimate capture in search algorithms that emphasize local search. 4.4.2 Noise benefits in molecular dynamics simulations Molecular dynamics simulations encompass a range of computation methods with wide applicability in biology, chemistry, and pharmacology. The following sections motivate the research finding of noise benefits in small atom-gas molecular dynamics. It focuses on the specific pharmacological problem that involves studying enzyme-ligand binding reactions. These molecular interactions are among the most complex class of problems because they exhibit essential dynamic behaviors. This often leads to more challeng- ing problems than those found in massive static molecular dynamics simulations such as those found in protein folding. Pharmacological molecular dynamics also demand 128 (a) Without noise (b) With noise Figure 4.2: Simulated annealing sample sequences from 5-dimensional (projected to 2-D) Schwefel with log cooling schedule show how noise increases breadth of search. Noisy simulated annealing visited more local minimas and quickly moved from the min- imas that trapped non-noisy SA. Both figures show sample sequences with initial con- dition x 0 = (0;0) and N = 10 6 . The red circle (lower left) indicates the global minimum at: x min = (420:9687;420:9687). (a) The non-noisy algorithm found the (205;205) local minima within the first 100 time steps. Thermal noise was not enough to induce the noiseless algorithm to search the space beyond three local minimas. (b) The noisy simulation followed the noiseless simulation at the simulation start. It sampled the same regions but the noise enhanced the thermal jumps and allowed the simulation to increase its breadth. It visited the same three minimas as (a) but it performed a local optimization for only a few hundred steps before jumping to the next minima. The estimate settled at (310;310) one hop away from the global minimum x min . special considerations for simulation speed because they often involve running tens of thousands of simulations in parallel during database search. Computational approaches to pharmacological molecular dynamics Simulations that model enzyme-substrate docking are increasingly popular in biochem- ical research and the pharmaceutical industry. Careful selection of algorithms that may include a search function and scoring function reduces an immense configuration space. Molecular dynamics methods employ dierent assumptions that might not apply for par- ticular interactions such as ligand rigidity or zero electrostatic interaction. Analysis can determine which molecules from a library of thousands are the most potent inhibitor or 129 (a) Convergence time (b) Minimum energy (c) Failure rate Figure 4.3: Simulated annealing noise benefits with 5-dimension Schwefel energy sur- face and log cooling schedule. Noise benefits three distinct performance metrics. (a) Noise reduced convergence time by 76%. We define convergence time as the number of steps the simulation takes to estimate the global minimum energy with error< 10 3 . Simulations with faster convergence will generally find better estimates given the same computational time. (b) Noise improved the estimate of minimum system energy by two orders of magnitude in simulations with a fixed run time (t max = 10 6 ). Figure 4.2 shows how the estimated minimum corresponds to samples. Noise increased the breadth of the search and pushed the simulation to make good jumps toward new minimas. (c) Noise decreased the likelihood of failure in a given trial by almost 100%. We defined a simulation failure if it did not converge by t = 10 7 . This is about 20 longer than the average convergence time. 4.5% of noiseless simulations failed. The simulation does not produce any sign of failure except an increased estimate variance between trials. Noisy simulated annealing produced 2 failures in 1000 trials (0.2%). 130 (a) Exponential cooling schedule (b) Geometric cooling schedule Figure 4.4: Noise benefits decrease convergence time under accelerated cooling sched- ules. Simulated annealing algorithms often uses an accelerated cooling schedule such as exponential cooling T exp (t) = T 0 A t or geometric cooling T geom (t) = T 0 exp At 1=d where A < 1 and T 0 are user parameters and d is the sample dimension. Acceler- ated cooling schedules do not have convergence guarantees like log cooling T log (t) = log(t + 1) but often provide better estimates given a fixed run time. Noise enhanced sim- ulated annealing reduced convergence time under an (a) exponential cooling schedule by 40.5% and under a (b) geometric cooling schedule by 32.8%. whether specific mechanism is probable. This section presents an overview of docking simulations as an illustrative molecular dynamics case study. Simulations Track Billions of Forces Computer simulations of enzyme-ligand binding are complicated simulations that occur in the virtual reality of a computer’s memory. Billions of forces require trillions of calculations within each simulation with the goal of determining precise interactions. Modeling only a few nanoseconds in reaction time often takes hours or days of process- ing time. The problem of enzyme-ligand binding breaks into two problems: predicting ligand orientation and binding anity. An ideal molecular dynamics method considers both of the facets and allow determination of both pre- and post-binding phenomenon. Such requirements are not yet computationally practical and studies usually focus on the most relevant aspect. Lead detection is a problem that ignores mechanistic details of the binding process and instead focuses only on the binding anity. Lead detection involves selecting a few compounds from a vast library of potential ligands that will most likely elicit response when reacted with an enzyme such as enzymatic inhibition. 131 After the selecting candidates the problem shifts to teasing out the actual reaction mech- anism and then determining the structure-activity relationship (SAR). This data often suggests novel species that meet the constraints posed by the SAR (e.g. steric, electro- static, aromatic considerations). Pharmaceuticals use Molecular Dynamics to Spot Leads in Immense Compound Libraries Molecular dynamics methods enter the scope of pharmaceuticals because drug devel- opment commonly focuses on the actions of a single enzyme. Estimates of the target enzyme’s structure are often enough to facilitate advances in designer ligands that react in a tailored way. Structural information is obtained from crystallographic studies for many enzymes. But these generate approximations of the protein conformation under experimental conditions. Relaxation methods can bring the protein into a more accurate form using the experimental conformation data as initial conditions. Molecular dynam- ics simulations can then accept or reject thousands of trial compounds with relatively high confidence. But successful simulations often take weeks or months to prepare and program. Reliable simulations require working knowledge of the tools. The advantages and disadvantages of each requires deep background knowledge. The next section will present an overview of molecular dynamics and then introduce specialized algorithms. It will then briefly discuss two other methods: genetic and evolutionary algorithms and point-complimentary. The section concludes with a molecular dynamics case study that identified probable HIV-1 integrase inhibitors to combat the HIV virus. Review of Molecular Computer Modeling Methods Traditional biochemical molecular dynamics modeling algorithms split into two compo- nents: a search strategy and a scoring function. The search function determines which configurations to examine at each time-step. Ideally the search function should include a guarantee that it will eventually select the optimum configuration. For example there are more than 2:4 10 15 possible alignments in a simple system that comprises a ligand with four rotatable bonds and six rigid body alignment parameters, angular sampling at 10 and spatial sampling at 0.5 Å. Each alignment requires complete characterization of the forces between the enzyme and ligand. Exhaustive search at 10 configurations / second would require 2,000,000 years of computation! It would take trillions of years 132 to finish if the problem was actually to identify the single compound with the highest anity for the protein from the 2.7 million known compounds. Molecular dynamics must reduce the search space even farther than allowed by tra- ditional spatial and temporal sampling. Some methods make restrictive assumptions that the ligand and enzyme behave in a completely rigid manner. This reduces the prob- lem to 6 degrees of rotational and dimensional freedom of the ligand and would lead to large speed boosts. For many problems this interaction assumption does not limit the estimate because sometimes protein conformation changes little during binding even if changes after binding result. Many ligands may also hold steady conformations until they bind to the protein completely or they may exhibit freedom in only one or two predictable torsional dimensions. But these assumptions are invalid in many cases and only hard chemistry or other brute-force techniques may identify the actual dynamics. More powerful molecular dynamics algorithms retain flexible ligands assumptions. The most powerful algorithms allow flexible enzyme binding sites by either making com- promising assumptions elsewhere in the state space or by using heuristic approaches to eliminate dead-end branches of the search tree. An almost universal simplifying assumption is molecular dynamics simulations occur in vacuo (within a vacuum environment). This has the consequence of remov- ing aqueous molecules such as water even though these sometimes play critical roles in the binding process. Taylor documented this limitation during the study of HIV-1 transcriptase inhibitors by showing that successful docking with the protein required mediation by several water molecules [338]. The reaction did not proceed as expected in the absence of water. More advanced methods include solvent eects to some degree but their inclusion is somewhat ad hoc. Some degree of measure is usually required after successful completion of the sim- ulation. For instance the problem might require an estimate of binding anity or inhibitory activity (e.g. pIC50, the amount of ligand required to inhibit 50% to obtain 50% inhibition). Scoring functions are very complicated and may require millions of calculations per estimate in these simulations. Scoring functions often employ internal simplifications such as considering only the most important subset of intramolecular forces such as hydrogen bonds or Van-der Waals interactions. Simpler functions such as RMSD (root mean-squared distance or Euclidean distance) take the place of general applicability functions (Figure 4.5) to further increase speed. These enhanced scoring 133 means benchmark one molecular dynamics algorithm to another by considering a set of test compounds with known final configurations. But simplifications bring additional constraints. For instance the RMSD scoring function requires that the placement of the ligand in the peptide complex must be known a priori. Comparison of the RMSD values then gives some idea of their respective performance. Figure 4.5: Example RMSD scoring measurements. (Upper Left) The centroid-oset of each ring complex defines ligand position and conformation. (Upper Right) A measure of the displacements from a representative subset of atoms. This is a useful method in cases where a specific configuration of reactive atoms drives the interaction. (Lower Right) Maximum Common Substructure (MCS) selects a template molecule and scores estimates against the reference MCS applies in cases similar to the previous method. Flavors of MD Simulations Trade Completeness for Speed Force-Motion-Body Simulations Force-motion-body simulations are a brute force technique in molecular dynamics, They are historically considered the first intro- duced simulation methods (Figure 4.6). The algorithms require calculation of the solutions to Newton’s equations of motion at each timestep. The solution lead to a gradient descent where the algorithm choose the next step by minimizing the local 134 energy at each frame. The gradient operator associates a vector in the direction of steepest descent with each point on the energy surface. Iteratively stepping along these gradient vectors drives the solution toward valleys in the energy surface. On occasion the valleys correspond to a global minimum. But the algorithm most often encounters local-minimas. These minimas plague less sophisticated MD algorithms while more robust methods can jump out of these local minimas by perturbing to the solution over peaks. Force-motion-body estimates depend heav- ily on the initial molecular state because of the application of iterative minima searching. Figure 4.6: One method of molecular dynamics simulations using the QXP force field model and GRID energy calculation. (1) The simulation select the target protein deter- mines its initial conformation. (2) It computes the molecular interaction potentials (MIP) within the protein. (3 & 4) It uses the MIPs to determine the where the lig- and may bind. (5 & 6) It performs optimizing calculations to find the optimal binding configuration which coincides with the global energy minimum [43]. 135 The earliest force-motion-body simulations relied on simplifying constraints dur- ing the simulation. The motion equations were often computationally unsolv- able with the degrees of freedom from the hundred’s of amino acids and rotatable bonds in the protein. One constraint involved the rigidity of the enzyme and dock- ing ligand. Most algorithms consider the protein host as a rigid body. The freedom of the ligand (rigid or flexible) varies from algorithm to algorithm. This constraint does not pose a serious limitation because protein conformation changes do not occur during ligand binding in some cases. This does not discount mechanisms that result in a post-binding conformational changes such as O2 to hemoglobin. But new studies illustrate classes of proteins where this assumption is less valid. More modern molecular dynamics methods such as those proposed by Wang and Pak allow simulations to include flexible ligands at cost of higher computational burden. Simulations using force-motion-body progressions for near-optimal ini- tial conditions are very powerful and are implemented in the software packages AMBER andCHARMM. Simulations based entirely on force-motion-body calculations have become less relevant as more powerful (statistical reliance and computation- ally) techniques became available. Markov Chain Monte Carlo Monte Carlo methods are arguably the most widely implemented molecular dynamics method. The name itself stems from the game Roulette (common in Monte Carlo) in which a player places bets on a subset of squares chosen ran- domly from the entire event space. The algorithm proceeds in a similar manner. It generates a small subset of candidate solutions from the configuration space. Monte Carlo methods generate subsequent solutions by starting with the current state and then perturbing it in some way. It evaluates the proposals after each generation based on a set of criteria such as potential energy state or total steric hindrance. Monte Carlo methods inevitably select a set of most likely models. Simulations often process these with force-motion-body methods to find nearby local minimas. The algorithm selects the smallest minimum after each generation and then it repeats the process. This design overcomes the sensitivity of the force- motion-body models to the initial conditions. It acts by choosing hundreds (or thousands) of possible starting states and then proceeds with the best solutions. 136 Dierent algorithms apply dierent selection criteria to limit which elements it chooses during the random assignment. The mostly widely implemented is called the Metropolis (or standard) Monte Carlo. The algorithm applies small random Cartesian moves to the system. It evaluates the fitness of each move using an energy calculation based on the Boltzmann probability distribution. The algo- rithm then decides whether the step is likely or not. If the algorithm finds very highly probable steps it will immediately accept this as the new state or minimize it using the gradient-descent methods (e.g. AMBER) before it proceeds to the next step. These implementation details make Monte Carlo methods very attractive to molecular chemists because they allow flexibility which chemists can exploit in a reduction of the computational expense at the cost of a small degree of accuracy. The packages AutoDock and Prodock amoung others provide Monte Carlo func- tionality to chemists. Each of these programs has the capability to deal with flex- ible ligands and they include allowances for flexible binding sites in the enzyme. These inclusions make Monte Carlo based prediction superior to their entirely force-motion-body dynamics based counterparts. Fragment Based DOCK incorporates fragment based methods into one of the most successful molec- ular simulation packages. Several other packages (ADAM and FlexX) also make use of fragment based methods. Fragment methods break the ligand molecule into many smaller sub-pieces which the algorithm then sequentially binds to the enzyme head to tail. The algorithms also apply gradient-descent or similar cost minimizing methods during each docking stage to obtain the time-dependent pro- cess of the fragment docking. These methods produce astounding results and often yield results with RMSD less than 1Å from experimentally obtained results. Studies of very often use these methods for lead detection because they can quickly evaluate a subset of ligand fragments and determine if the subject leads to a viable candidate compound. For example a study correctly predicted biotin (a ligand with a very high anity for binding to streptavidin) as the highest-scoring ligand while searching for the most ecient binding to the streptavidin protein from a reference library of 80,000 candidates, 137 Figure 4.7: The FlexX algorithm partitions major reactive substructures called frag- ments from the ligand. It attempts to optimally bind each of these in turn to the target protein. It averages many near-optimal binding configurations together to obtain a more accurate representation. It also uses torsion-angle relationships to determine the optimal internal-alignment of the bound ligand. It finally runs the results through and additional processing step to bring the reconstituted-ligand-enzyme complex to its lowest energy. The fragment based methods consider only fragments at a time and can therefore aord complete intra-fragment flexibility As the fragments come together they introduce constraints on the available orientation of proceeding fragments. But the fragmentation methodology imparts a heavy cost on the sensitivity of frag- ment based methods. If fragmentation inadvertently disjoins functional regions of the ligand it can lead to a complete loss of chemical functionality in the pro- cess. This may then result in missing critical regions of the energy surface thereby resulting in inaccurate chemical mechanisms. Fragmentation based simulations require expertise in the chemistry of the functional units. To compensate molec- ular dynamics simulations often simulate competing fragmentation methods for results comparison. The MIMUMBA torsion angle database is included within the FlexX package successfully automates this process. This makes fragment based 138 methods very popular despite the fact that most of the algorithms lack the ability to consider flexible enzymes. Other Methods This section concludes with brief summaries of several other molecular dynamics methods have gained acceptance in the field. Systematic Searches Systematic searches perform direct translation and rotation of the ligand and enzyme to obtain the optimal solution at each timestep. The algorithm samples the configuration space because of its immense size to yield a reduction in com- putation cost. The method couples the search with a fast ane transform. This allows the search to complete in a realizable times. This search diers from all of the other methods because it does not descend along a trajectory to find the global minimum of the energy surface. Instead it attempts to exhaustively map the con- figuration space onto the scoring function to obtain the minimum. The SYSDOC program originally developed this method and the newerEUDOC extends the capa- bilities of the earlier program. But a systematic search of rigid body rotations and translations of a rigid ligand within a rigid active site still limit the newestEUDOC program. Genetic Algorithms and Evolutionary Programming Genetic algorithms are a relatively new molecular dynamics paradigm. Genetic algorithms internalize artificial competition between probable candidates. The algorithms select winners from each generation with a survival of the fittest heuris- tic. They code salient characteristics of a good candidate into genes. Simulations create new ospring at each timestep by random combination of characteristics of the most successful members of the previous generation (previous timestep). Genetic methods also introduce crossovers and point-mutations to increase the diversity of the gene pool. New combinations of traits emerge by recombining the fittest members from each generation. This drives the mechanism toward the most probable estimate. Crossovers enable previously useful portions of sub-optimal estimates to repeat while point mutations enable the emergence of novel genes. Many other fields of study use genetic and evolutionary methods including tran- sistor circuit layouts and mathematical N-P complete type problems. They are a 139 relative new-comer to the molecular dynamics field (< 10 years) but they have considerable acceptance and are becoming very popular in the literature. Point-Complimentary Point Complimentary methods span a range of simulation techniques. These algo- rithms tend to deal specifically with steric eects and related coupling eects (i.e. Hydrogen bonding). The fact that until only recently they were constrained to con- sideration of rigid receptors and ligands has limited the scope of these algorithms somewhat. This is serious disadvantage with respect to enzyme-ligand simula- tion. But the algorithm constraints make it ideal for studying complex protein- protein interactions. Many point-complimentary methods apply to interactions of four or five proteins in super-complexes. The point-complimentary methods generally consider the enzyme and its docking partner as space-filling models. Often the algorithms use small cubes to generate volumetric models because the 6 transverse sides make calculating force interactions between faces much easier. New methods also apply sphere packing algorithms to increase their accuracy. Once the algorithm creates a space model for each component it computes inter- actions that minimizes steric collisions. The algorithm then fine tunes the models to maximize the eects of Hydrogen bonds and other coupling forces. MULTIDOCK and FTDOCK are two software packages that perform this classical form of point- complimentary simulation. Some new methods such as the FLOG (flexible ligand orienting grid) package allow for some soft docking to compensate for the rigidity requirements. Taboo Searches The taboo search is a stochastic evolution of the generalized scoring function ChemScore [338]. It descends the energy gradient-curve similar to the force- motion-body methods but it instead samples space at a few discrete regions. The process maintains a FIFO (first-in-first-out) taboo list with the 25 most recently used conformations or states to ensure sampling diversity. The algorithm typi- cally generates and scores a set of 100 possible solutions during the selection. If the lowest energy solution from the ranked population is the lowest energy so far the algorithm always accepts this as the new current solution. But if the lowest energy solution from the ranked population is not the lowest energy so far the 140 algorithm uses the best non-taboo solution. A move is considered taboo if it is within an RMSD of 0.75Å of any solution stored in the taboo list. The process repeats during each iteration for a user defined number of times. The algorithm selects the final configuration from the repeated step at the end of each iteration and proceeds until it reaches a user defined termination condition. Hybrid methods The most successful molecular dynamics simulation methods combine several components into one cohesive unit. Markov Chain Monte Carlo introduced this idea first by using random fluctuations to attain the global minimum from force- motion-body methods. Combining methods garners the strengths from each sub- algorithm while and balances their limitations against strengths of the coupled sub-algorithms. Problems involving flexible ligand binding to a flexible protein scaold have seen the greatest benefit from these mergers. The gain comes in the form of using simpler methods to compute gross macroscopic models of the mechanism which and then using more algorithms (such as systematic searches or direct molecular dynamic calculations) to refine estimates. The hybrid algo- rithms reap more benefit by pruning improbable mechanisms at the computation- ally inexpensive macroscopic level since detailed algorithms cannot predict such dead ends until they reach them. The following case study [228] illustrates the dicult nature of drug discovery due to the scale of enzyme-ligand molecular dynamics simulations. A case study: 3D-QSAR HIV-1 integrase inhibitors Makhija and Kulkarni applied 3D-QSAR (qualitative structure-activity relationship) methods to the screening of HIV-1 integrase inhibitors [228]. The HIV-1 integrase pro- tein is used by HIV retrovirus to splice (integrate) its own genetic material with the host cell. This process consists of 3 steps: 3’ processing. The integrase proteins cleaves two nucleotides from each strand of DNA generated by HIV reverse-transcriptase. This exposes reactive OH species on each 3’ end. 141 End joining. The processed 3’ ends are then joined to 5’ ends in the host DNA at the site of integration. This is often called strand transfer. DNA repair synthesis. Host DNA repairases are recruited to ligate the gaps gener- ated during the integration process. Following this, the viral RNA is successfully embedded in the host and synthesis begins. They applied CoMSIA and CoMFA simulation models to the inhibitor problem. CoMSIA and CoMFA are two specialized types of molecular dynamics simulation. Each makes dierent assumptions to simplify the force-motion-body relations to reduce the configuration space to a tractable size. The authors selected and tested 27 compounds belonging to the class of thiazolothiazepines (template Figure 4.8) for the purpose of training and calibration. Thiazolothiazepines inhibit integrase function by interfering with the 3’ processing step (#1). In addition to including many potent inhibitors some of these compounds exhibit very little inhibitory eect creating an activity baseline. After training the authors selected two sets of compounds for further simulation The first test set included 7 new thiazolothiazepines and second test set consisted of 4 molecules from the cumarin class that impede the 3’ end joining step (#2). Analysis of the results indicated that there is a strong correlation between the inhibitory activity of a compound and the steric and electrostatic fields around it in both classes of ligand inhibitors. It also found that hydrophobic interactions were poorly cor- related with inhibitory activity. This indicated that the binding of these compounds to the active site was mainly enthalpic in nature. Both models provided reasonable accuracy when compared to the a priori mechanisms of some of the enzyme-ligand interactions. But the CoMSIA presented greater internal and external prediction capabilities against with crystallographic benchmark study results. The study concludes by using CoMFA and CoMSIA models to generate a single met- ric indicative of the ligands inhibitory potential (measured as pIC50 : ligand concen- tration to reduce enzyme activity by 50%). But these models also produce sensitivity maps detailing the mechanistic binding between the ligand and enzyme (Figure 4.9). The models derive these powerful maps by correlating the characteristics of each of the tested ligands with its expected activity. This increases the research potential since the analysis can include more than just a pictorial representation of the interaction but it can also indicate regions in the binding complex that depend heavily on the presence or 142 Figure 4.8: Makhija and Kulkarni considered 27 compounds in the HIV-1 Integrase inhibitor case study. This figure shows six representative molecules from their dictio- nary. Representative compounds used in the HIV-1 Integrase inhibitor case study. The R groups correspond to various side-chains enumerated in the study. They tested 48 candidates in total after considering variations. absence of electrostatic forces and steric collisions. Similar studies can use these maps to augment the initial database with to novel compounds and even novel molecular classes of compounds that meet the constraints outlined by the prospective interactions. 143 Figure 4.9: Maps indicating spatial sensitivity of the interaction to electrostatic (left) and steric (right) factors. In the left pane blue regions encompass regions where an increase in positive charge enhances anity. The red contoured areas show where neg- ative charges provide favorable binding. The right pane shows steric interaction maps for two compounds from the HIV-1 Integrase inhibitor case study. The green contours enclose areas where steric bulk may enhance anity. The yellow contours indicate regions simulations predict will decrease binding anity if they become occupied. Noise speeds Lennard-Jones 12-6 simulations The second simulation shows a noise benefit in an MCMC molecular dynamics model. This model used the noisy Simulated Annealing algorithm (Algorithm 4.1) to search a 24-dimensional energy landscape. It used the Lennard-Jones 12-6 potential well to model the pairwise interactions between an 8 argon atom gas. 144 Figure 4.10: Superposition of CoMFA contour plots on active site of HIV-1. The Lennard Jones (12-6) potential well approximates the interaction energy between two neutral atoms [216, 217, 306] V LJ = " r m r 12 2 r m r 6 # (4.56) = 4 " r 12 r 6 # (4.57) where is the depth of the potential well, r is the distance between the two atoms, r m is the interatomic distance corresponding to the minimum energy, and is the zero potential interatomic distance. Figure 4.11 shows how the two terms interact to form the energy surface: (1) the 12-term dominates at short distances since overlapping electron orbitals cause strong Pauli repulsion to push the atoms apart and (2) the 6-term dom- inates at longer distances because van der Waals and dispersion forces pull the atoms 145 Figure 4.11: The Lennard-Jones 12-6 potential well approximates pairwise interactions between two neutral atoms. The figure shows the energy of a two-atom system as a func- tion of the interatomic distance. The well is the result of two competing atomic eects: (1) overlapping electron orbitals cause strong Pauli repulsion to push the atoms apart at short distances and (2) van der Waals and dispersion attractions pull the atoms together at longer distances. Three parameters characterize the potential: (1) is the depth of the potential well, (2) r m is the interatomic distance corresponding to the minimum energy, and (3) is the zero potential interatomic distance. Table 4.1 lists parameter values for argon. toward a finite equilibrium distance r m . Table 4.1 shows the value of the Lennard-Jones parameters for argon. Table 4.1: Argon Lennard-Jones 12-6 parameters 1:654 10 21 J 3:404 10 10 m r m 3:405 Å 146 The simulation estimated the minimum energy coordinates for 8 argon atoms in 3 dimensions. We performed 200 trials at each noise level. We summarized each trial as the average number of steps to estimate the minimum energy within 10 2 . Figure 4.12 shows that noise produces a 42% reduction in convergence time over the non-noisy simulation. Figure 4.12: MCMC noise benefit for an MCMC molecular dynamics simulation. noise decreases the convergence time for an MCMC simulation to find the energy minimum by 42%. The plot shows the number of steps that an MCMC simulation needs to converge to the minimum energy in a eight argon atom gas system. The optimal noise had a standard deviation of 0.64. The plot shows 100 noise levels with standard deviations between 0 (no noise) and = 3. Each point averages 200 simulations and shows the average number of MCMC steps required to estimate the minimum to within 0:01. We modeled the interaction between two argon atoms with the Lennard-Jones 12-6 model = 1:654 10 21 J and = 3:405 10 10 m = 3:405Å [306]. 147 Chapter 5 Noisy Simulated Quantum Annealing This chapter shows that carefully injected noise can speed the convergence of quantum annealing algorithms. The major result is the noise boosted quantum annealing (QA) algorithm. Simulated quantum annealing uses a network of specially connected copies of the energy surface to model quantum evolution within a classical computer. The noise boosted algorithm describes how to inject noise between the virtual copies to induce quantum speedup. Simulations shows that it improve ground-state energy calculations and increases convergence speed on high-dimensional energy surfaces. 5.1 Quantum Annealing Quantum annealing (QA) uses quantum forces to evolve the state according to the quan- tum Hamiltonian instead of thermodynamic excitation in simulated (classical) anneal- ing. Simulated quantum annealing uses an MCMC framework to simulate draws from the square of the wave function instead of solving the time-dependent Schrödinger equa- tion: i¯ h @ @t (r;t) = " ¯ h 2 2 r 2 + V (r;t) # (r;t) (5.1) where is the particles reduced mass, V is the potential energy,r 2 is the Laplacian operator and is the wave function. The classical simulated annealing acceptance prob- ability is proportional to the ratio of a function of the energy of the old and new states. This can prevent beneficial hops if there are energy peaks between minimas. Quantum annealing introduces probabilistic tunneling to allow occasional jumps through high energy peaks. Ray and Chakrabarti [289] recast Kirkpatrick’s thermodynamic simulated anneal- ing using quantum fluctuations called quantum annealing. The algorithm introduces a transverse magnetic field in place of temperature T in classical simulated annealing. The strength of the magnetic field governs the transition probability of the system. The 148 adiabatic theorem ensures that the system remains near the ground state during slow changes of the field strength. Adiabatically decreasing the temperature H (t) = 1 t T H 0 + t T H P (5.2) then gives the minimum energy configuration of the underlying potential energy surface as time t approaches a fixed large value T. Simulated quantum annealing for an Ising spin glass usually applies the Edwards- Anderson model Hamiltonian with a transverse magnetic field J ? H = U+ K = X hi ji J i j s i s j J ? X i s i : The transverse field J ? and classical Hamiltonian J i j have a nonzero commutator in general h J ? ;J ij i , 0 (5.3) where the commutator operator [A;B] = AB BA. Path-integral Monte Carlo (PIMC) method is a standard quantum annealing method [231] that uses the Trotter approxima- tion for non-commuting quantum operators e (K+U) e K e U (5.4) where [K;U], 0 and = 1 k B T . The Trotter theorem shows that the approximation is asymptotically exact [314]. The Trotter theorem is equivalent to the following theorem. The proof uses the notion of a contraction semigroup to show that an asymptotically equivalent operator exists. Definition 5.1. A contraction semigroup on a Banach space b V is a family of bounded linear operators ˆ P t with 0 t<1 defined everywhere on b V and constitute mappings such that ˆ P 0 = 1. ˆ P t ˆ P s = ˆ P t+s for t 0 and s1. lim t!1 ˆ P t = . 149 ˆ P t 1. where the norm ˆ P t = inf 2B n : ˆ P t kk for all2 b V;kk 1 o : (5.5) Theorem 5.1. Let ˆ P and ˆ Q be linear operators on a Banach space b V . Let 2 b V . Then there is a linear operator ˆ R on b V such that ˆ R t lim n!1 ˆ P t n ˆ Q t n = 0 (5.6) Proof. Let ˆ A and ˆ B be infinitesimal generators of the contraction semigroups ˆ P t and ˆ Q t . Thus ˆ A = lim t!0 1 t ˆ P t (5.7) for vectors 2 b V . Let h> 0. Then ˆ P h ˆ Q h 1 = ˆ P h 1 + ˆ P h ˆ Q h 1 : (5.8) Rewriting the above with ˆ A and ˆ B gives ˆ P h ˆ Q h 1 = h ˆ A+ ˆ B +O (h) (5.9) whereO (h) denotes any vector2 b V such that lim h!0 kk h = 0: (5.10) Also ˆ R h 1 = h ˆ A+ ˆ B +O (h): (5.11) Therefore ˆ P h ˆ Q h ˆ R h =O (h): (5.12) 150 Let h = t n . Then it remains to show that lim n!1 h ˆ P h ˆ Q h n ˆ R hn i = 0 (5.13) Expanding the inner term ˆ P h ˆ Q h n ˆ R nh = ˆ P h ˆ Q h ˆ R h ˆ R (n1)h + ˆ P h ˆ Q h ˆ P h ˆ Q h ˆ R h ˆ R (n2)h + (5.14) + ˆ P h ˆ Q h n1 ˆ P h ˆ Q h ˆ R h : (5.15) Applying the operator to and taking the norm gives h ˆ P h ˆ Q h n ˆ R nh i ˆ P h ˆ Q h ˆ R h ˆ R (n1)h + ˆ P h ˆ Q h ˆ P h ˆ Q h ˆ R h ˆ R (n2)h + (5.16) + ˆ P h ˆ Q h n1 ˆ P h ˆ Q h ˆ R h : (5.17) after applying the triangle inequality. There are n terms of orderO (h) on the right hand side. Thus the right side varies nO (h) = nO t n . But nO t n ! 0 (5.18) as n!1. Therefore lim n!1 h ˆ P h ˆ Q h n ˆ R nh i = 0: (5.19) 151 Applying the Trotter approximation yields an estimate of the partition function for Z = Tr e H (5.20) = Tr exp " (K+ U) P #! P (5.21) = X s 1 X s P D s 1 je (K+U)=P js 2 E D s 2 je (K+U)=P js P E D s P je (K+U)=P js 1 E (5.22) C NP X s 1 X s P e H d+1 PT (5.23) = Z P (5.24) where N is the number of lattice sites in the d-dimensional Ising lattice, P is the number of imaginary-time slices called the Trotter number, C = s 1 2 sinh 2 PT ! (5.25) and H d+1 = P X k=1 0 B B B B B B B @ X i j J i j s k i s k j + J ? X i s k i s k+1 i 1 C C C C C C C A : (5.26) Martoˇ nák studied the dependence of path integral Monte Carlo annealing algorithm on the choice of the number of Trotter slices. The product PT determines the spin replica couplings between neighboring Trotter slices and between the spins within slices. Shorter simulations did not show a strong dependence on the number of Trotter slices P. This is likely because shorter simulations spend relatively less time under the lower transverse magnetic field to induce strong coupling between the slices. Thus the slices tend to behave more independently than if they evolved under the increased coupling from longer simulations. High Trotter numbers (N = 40) showed substantial improve- ments for very long simulations. They compared the high Trotter simulations to classical annealing and computed that path integral quantum annealing gave a relative speed up of four order of magnitude over classical annealing. They relate this fact by noting that 152 “one can calculate using path-integral quantum annealing in one day what would be obtained by plain classical annealing in about 30 years.” 5.2 Quantum Annealing to Solve NP problems Quantum annealing outperforms classical simulated annealing in cases where the poten- tial energy landscape contains many high but thin energy barriers between shallow local minima [289]. It is specially suited to problems in discrete search spaces with huge num- bers local minimas such as finding the ground state of an Ising spin glass. Lucas recently produced Ising formulations for Karp’s 21 NP-complete problems [227]. The NP- complete problems include many optimization benchmarks such as graph-partitioning, finding an exact cover, integer weight knapsack packing, graph coloring, and traveling salesman. NP-complete problems are a special class of decision problem that have time complexity super-polynomial (NP-hard) to the input size but only polynomial time to verify the solution (NP). Advances by D-Wave Systems have brought quantum anneal- ers to market and shown how adiabatic quantum computers are suitable for solving real world applications [317]. Quantum Computational Complexity Complexity theory characterizes the inherent hardness or diculty of computational problems [254, 191, 358]. It formalizes the idea that some problems are harder than others. Complexity theory classifies algorithms according to how solutions evolve over time and how the amount of time depends on the relative size of the problem. Complexity theory works generally within the framework of alphabets, strings, and languages. Turing machines provide the standard means to specify computational prob- lems and function evaluators. Definition 5.2. A Turing machine is an process that operates equivalently to the process defined by the 7-tuple M =hQ;;b;;;q 0 ; Fi where Q is a finite, non-empty set of states. is a finite, non-empty set of alphabet symbols. b2 Gamma is the blank symbol. 153 nfbg is the set of input symbols. : (Qn F)! QfL;Rg is a partial function called the transition function. L is the left shift operation and R is the right shift operation. q 0 2 Q is the initial state F Q is the set of accepting states. Definition 5.3. An algorithm is polynomial time if T (n) = O n k (5.27) where T (n) is the running time of an algorithms, n is the size of the algorithm input, and k<1 is a constant. Decision problems form an important subset of the set of all computational prob- lems. Decision problems require partitioning the space of all possible inputs into one of two classes. A decision problem is a pair D = D yes ; D no of sets where D yes ; D no and D yes \ D no =;. Classical complexity theory uses the notion of complexity classes to assign a relative diculty to particular decision problems. The following list describes the membership criteria for the most common complexity classes. P, Polynomial time A problem D = D yes ; D no is in P if and only if there exists a polynomial-time deterministic Turing machine M that accepts every string x2 D yes and rejects every string x2 D no . NP, Nondeterministic polynomial time A problem D = D yes ; D no is in NP if and only if there exists a polynomial-bounded function p and a polynomial-time deterministic Turing machine M with the following properties: For every string x2 D yes it holds that M accepts (x;y) for some string y2 p(jxj) , and for every string x2 D no it holds that M rejects (x;y) for all strings y2 p(jxj) BPP, Bounded-error probabilistic polynomial time A problem D = D yes ; D no is in BPP if and only if there exists a polynomial-time probabilistic Turing machine M that accepts every string x2 D yes with probability at least 2 3 , and accepts every string x2 D no with probability at most 1 3 . 154 PP, probabilistic polynomial time A problem D = D yes ; D no is in PP if and only if there exists a polynomial time probabilistic Turing machine M that accepts every string x2 D yes with probability strictly greater than 1 2 , and accepts every string x2 D no with probability at most 1 2 . SZK A problem D = D yes ; D no is in SZK if and only if it has a statistical zero- knowledge interactive proof system. PSPACE, polynomial space A problem D = D yes ; D no is in PSPACE if and only if there exists a deterministic Turing machine M running in polynomial space that accepts every string x2 D yes and rejects every string x2 D no . EXP, exponential time A problem D = D yes ; D no is in EXP if and only if there exists a deterministic Turing machine M running in exponential time (running time bounded by 2 p for some polynomial-bounded function p) that accepts every string x2 D yes and rejects every string x2 D no . NEXP, nondeterministic exponential time A problem D = D yes ; D no is in NEXP if and only if there exists and exponential-time non-deterministic Turing machine N for A. PL A problem D = D yes ; D no is in PL if and only if there exists a probabilistic Turing machine M running in polynomial time and logarithmic spaces that accepts every string x2 D yes with probability strictly greater than 1 2 and accepts every string x2 D no with probability at most 1 2 . The complexity classes form two natural hierarchies PL P BPP S ZK PS PACE EXP NEXP (5.28) PLNC P NP PP PS PACE EXP NEXP (5.29) Complexity theory denotes the set of decision problems that a quantum computer can eciently solve by a special class called BQP. The definition of BQP uses the concept of quantum circuit encodings. Definition 5.4. A quantum circuit refers to any acyclic network of quantum gates con- nected by wires. Quantum gates refer to general quantum operations that operate on some constant number of qubits. Wires represent the qubits that the gates operate on. 155 The following Universality Theorem shows that quantum circuits can arbitrarily approximate quantum operations [254]. Theorem 5.2. Let be an arbitrary quantum operation from n qubits to m qubits. Then for all > 0 there exists a quantum circuit Q with n input qubits and m output qubits such that(; Q)< where (; Q) = 1 2 k Qk (5.30) is called the diamond norm [190, 191]. Moreover, for fixed n and m, the circuit Q may be assumed to satisfy size(Q) = poly log 1 !! : (5.31) Definition 5.5. Let S be any set of input strings. Then a collectionfQ x : x2 Sg of quantum circuits is called polynomial-time generated if there exists a polynomial-time deterministic Turing machine that outputs an encoding of Q x for every input x2 S . BQP(a;b), bounded error quantum polynomial time A problem D = D yes ; D no is in BQP(a;b) if and only if there exists a polynomial-time gener- ated family of quantum circuits Q =fQ n : n2Ng where each circuit Q n takes n input qubits and produces one output qubit that satisfies: 1. if x2 D yes then P Q accepts x a(jxj). 2. if x2 D no then P Q accepts x b(jxj). for some functions a;b :N! [0;1]. BQP A problem D = D yes ; D no is in BQP if and only if D2 BQP 2 3 ; 1 3 . 5.3 The Noisy Simulated Quantum Annealing Algo- rithm Algorithm 5.1 describes the noisy quantum annealing algorithm. The algorithm uses the Trotter approximation to simulate quantum annealing within a classical computer. The 156 algorithm injects noise at each time step and propagates changes along the Trotter ring 5.1. Quantum simulations generally work with qubits. While the qubits themselves are not in general restricted to particular states measurements collapses the qubit wavefunc- tion to one of two possible states. This justifies the Ising spin model as a general qubit modeling tool. Algorithm 5.1 The Noisy Quantum Annealing Algorithm 1: procedure NoisySimulatedQuantumAnnealing(X, 0 , P, T) 2: x 0 Initial(X) 3: for t 0; N do 4: TransverseField (t) 5: J ? TrotterS cale(P;T;) 6: for all Trotter slices l do 7: for all spins s do 8: x t+1 [l; s] Sample x t ; J ? ; s;l 9: procedure TrotterScale(P, T, ) 10: return PT 2 logtanh PT 11: procedure Sample(x t , J ? , s, l) 12: E LocalEnergy J ? ; x t ; s;l 13: if E> 0 then 14: returnx t [l; s] 15: else if Uniform[0;1]< exp(E=T) then 16: returnx t [l; s] 17: else 18: if Uni f orm[0;1]< NoisePower then 19: E + LocalEnergy J ? ; x t ; s;l + 1 20: E LocalEnergy J ? ; x t ; s;l 1 21: if E> E + then 22: x t+1 [l + 1; s] x t [l + 1; s] 23: if E> E then 24: x t+1 [l 1; s] x t [l 1; s] 25: return x t [l; s] 5.3.1 Noise improves quantum MCMC The following simulation shows a noise benefit in simulated quantum annealing. It shows that noise that obeys a condition similar to the N-MCMC theorem improves the ground-state energy estimate. 157 Open problem: how does this abstraction tie back to true quantum annealing noise noise Trotter slice n‐1 Trotter slice n Trotter slice n+1 Figure 5.1: The noisy quantum annealing algorithm propagates noise along the Trotter ring. After each time step the algorithm inspects the local energy landscape. It injects noise in the form of flipping the spin of Trotter neighbors. This in turn diuses the noise across the network because correlations between the Trotter neighbors encourage convergence to the optimal solution. We used path-integral Monte Carlo quantum annealing to calculate the ground state of a randomly coupled 1024-bit (32x32) Ising quantum spin system. The simulation used 20 Trotter slices to approximate the quantum coupling at temperature T = 0:01. It used 2-D periodic horizontal and vertical boundary conditions (toroidal boundary con- ditions) with coupling strengths J i j drawn from Uniform[2;2]. Each trial used random initial spin states (s i 21;1). We used 100 pre-annealing steps to cool the simulation from an initial temperature T 0 = 3 to T q = 0:01. The quantum annealing linearly reduced the transverse magnetic field B 0 = 1:5 to B f inal = 10 8 over 100 steps. After each updated we performed a Metropolis-Hastings pass for each lattice across each Trotter slice. We maintained T q = 0:01 for the entirety of the quantum annealing. The simulation used the standard slice coupling between Trotter lattices J ? = PT 2 lntanh B t PT (5.32) where B t is the current transverse field strength, P is the number of Trotter slices, and T = 0:01. 158 ‐1800.000 ‐1600.000 ‐1400.000 ‐1200.000 ‐1000.000 ‐800.000 ‐600.000 ‐400.000 ‐200.000 0.000 0.000 0.010 0.020 0.030 0.040 0.050 energy, E Noise power theorem blind ground state ‐1800 ‐1600 ‐1400 ‐1200 ‐1000 ‐800 ‐600 ‐400 ‐200 0 0 0.01 0.02 0.03 0.04 0.05 energy, E Noise power theorem blind ground state ‐1650 ‐1450 ‐1250 ‐1050 ‐850 ‐650 ‐450 ‐250 ‐50 0 0.01 0.02 0.03 0.04 0.05 Energy Noise power N‐MCMC theorem Blind true ground state Noise benefit Estimate error Figure 5.2: Simulated quantum annealing noise benefit in 1024 Ising spin simulation. the pink line shows that noise improves the estimated ground-state energy of a 32x32 spin lattice by 25.6%. This plot shows the ground state energy after 100 path-integral Monte Carlo steps. The true ground state energy (red) is E 0 =1591:92. Each point is the average calculated ground state from 100 simulations at each noise power. The blue line shows that blind (iid sampling) noise does not provide benefit to the simulation. This shows that the N-MCMC condition is central to the S-QA noise benefit. The simulation injected noise into the model using according to a power parameter 0< p< 1. The algorithm extended the Metropolis-Hastings test to each lattice-site by conditionally flipping the corresponding site on coupled trotter slices. We benchmarked the results against the true ground state E 0 =1591:92 [66]. Figure 5.2 shows that noise that obeys the N-MCMC benefit condition improved the ground- state solution by 25.6%. This corresponds to a reduction simulation time by several orders of magnitude since the estimated ground state largely converges by the end of the simulation. We were not able to quantify the decrease in convergence time because the non-noisy quantum annealing algorithm did not converge near the noisy quantum annealing estimate during any trial. Figure 5.2 also demonstrate that the noise benefit is 159 not a simple diusive benefit. For each trial we also computed the result of blind noise by injecting noise identical to the above but without regard for satisfying the N-MCMC condition. Figure 5.2 shows that blind noise reduces the accuracy of the ground state estimate by 41.6%. 160 Chapter 6 Future directions 6.1 -stable Noisy Simulated Annealing The-stable densities generalize the Gaussian density and they appear in a large number of practical settings through the generalized central limit theorems. Figure 6.1 shows that symmetric alpha-stable noise can lead to substantial MCMC performance benefits. Figure 6.2 provides intuition into the means of the benefit. This suggests that tuning tail thickness may provide a useful means to enhance MCMC the noise benefit. This suggests MCMC algorithms could tune noise tail thickness to further enhance the noise benefit. 6.2 Controlled Diusions for Biochemical Optimization Preliminary simulations show that additive Cauchy noise enables diusion of the pro- tein kinesin at loading levels 3 orders of magnitude greater than typical Brownian motion. The ATPase kinesin ”walks” along intra-cellular microtubules rectified by the de-phosphorylation of ATP [111]. Brownian motion drives the walking portion of the complex cycle. Increasing loads ( 2 pN) inhibit the diusion under Brownian motion because a deep energy trough. Biological motors fall into deep troughs on their energy surfaces under loading. Brownian motion does not allow the motors to break free because of the irregularity of high energy events. Previous studies show that impulsive noise can propel systems out of the troughs and into other minimas on the energy surface [200, 199]. Our prelim- inary simulation shows that Cauchy noise can ”push” the nano-motors over the peaks under loads 3 orders of magnitude larger than typical Brownian motion. We presume that many other-stable distributions similarly facilitate diusion under high loads. We want to extend our simulations to study general Lévy processes. This would provide understanding of this nano-scale system under a wide range of noise inputs. 161 0 100 200 300 400 500 600 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Tail thickness ( ߙ ) Absolute Error ( หܧ ௌ െܧ ௨ௗ ห ) Figure 6.1: Impulsive-stable noise improves simulated annealing optimization. This plot shows the absolute error between the actual global minimum and the noisy sim- ulated annealing estimate of the global minimum. The simulation injected -stable impulsive noise with 2 [0:2;2]. Increasing impulsiveness improved the estimate by 2 orders of magnitude. The presence of the jumps in the graph indicates that there are impulsiveness thresholds that lead to better search. Future work will further charac- terize this relationship and identify potential avenues where adaptive algorithms might tune the noise to improve performance. The Langevin equation (Equation 6.1) approximates the diusion of a particle of mass M, with Brownian noise ˜ F(t), and load f [206]. M dv dt = v f + ˜ F(t) (6.1) Mathematical analysis studies the evolution of the Langevin equation under Brownian noise through the use of the Fokker–Planck equation. This requires finite variance of the driving noise ˜ F(t) which-stable noise does not possess in general [82]. We will study the application of Lévy processes to the diusion equation and determine how we can obtain a general solution. We can realize the earlier simulation with optical tweezers and apply a signal g(t) = f + ˜ F(t) to the ATPase molecule. g(t) exists as a single Lévy noise source with non- zero location. The validation would cover the family of-stable distributions and would demonstrate the robustifying eects of-stable noise. 162 -5 0 5 (a) Brownian motion 0 50 100 150 20 0 250 30 0 350 40 0 -100 0 10 0 20 0 30 0 (b) Impulsive Cauchy flight Figure 6.2: (a) Extensive local search characterizes Brownian motion. The random walk proceeds according to normally distributed step sizes. In two dimensions this leads to regular returns to regions on a small scale. This is not a desirable behavior in large high dimensional spaces because it means that the process requires long search times. (b) Cauchy motion performs a similar random walk but with impulsive Cauchy step sizes. The relative scales on the two sequence realizations (lower) show that Cauchy flights regularly take small steps but often take very large “flights” to new regions of the space. This leads to better search in large spaces. The Brownian and Cauchy random walks represent two members from a continuum of random walks with varying propensity to leap. Future research should study the eciency of the search as a function of the impulsiveness. The major open question is if impulsive jumps are always superior or if there is an optimal value that depends on the specific problem. Our current Monte Carlo simulation (N=10 kinesin molecules) follows each kinesin molecule through the walking sequence over 5 seconds with a timestep of 0.1 ms [165]. Each complete cycle of the ATPase walk advances the molecule by 8 nm [111]. The simulation uses the forward Euler method (timestep< 0.1 us) to iteratively arrive at the ATPase trajectory. We omit the drag term because the inertial velocity of the motor is negligible at such low Reynolds number (Langevin relaxation time = 1:08 10 12 s). Figure 6.3 and figure 6.4 summarize the findings. Experimental confirmation of the primary hypothesis would show that biological nano-motors can provide controlled transport at the molecular level under a wide range of loading. Selectively controlling the speed of many such motors would create “active 163 catalysts”. These catalysts could deliver molecular cargo to a reaction site and their speed would precisely control the time at which the products undergo reaction. 0 1 2 3 4 5 6 x 10 −10 0 10 20 30 40 50 60 70 80 90 ATPase Travel vs. Load Force Cauchy & White Gaussian Noise, log(Noise Variance) = −20.338116 Load Force Number of Steps (x8 nm) White Gaussian White Cauchy (a) number of kinesin steps (no noise) −25 −24 −23 −22 −21 −20 0 2 4 6 x 10 −10 75 80 85 log Noise Dispersion ATPase Travel vs. Load Force & Noise Dispersion Load Force Number of Steps (x8 nm) (b) number of kinesin steps (additive noise) Figure 6.3: Kinesin walking vs. Loading (a) The distance the kinesin molecule walks driven by Brownian noise depends on the imposed load. The sharp drop-o shows that Brownian noise can not overcome the energy required for the protein to diuse. But Loading does not inhibit the molecule diusion under Cauchy noise demonstrated by the flat behavior over the load range. (b) The plot shows that Cauchy noise allows diusion over a wide range of loads. The number of steps kinesin takes in 5 seconds remains almost unchanged (15–20 steps per second) over the load range. The spikes over the surface arise from the realizations during the Monte Carlo simulation. Our hypothesis suggests that a wide range of probability distributions between Cauchy and Gaussian demonstrate this ”robustifying” eect. We hypothesize that-stable noise robustifies kinesin locomotion. Noise in the form of heat propels chemical reactions in all systems. This noise is diuse and locally uni- form leading free components to wander and prevents extraction of any usable work from the background noise. Ratcheting systems provide an attractive means of direct- ing reactions and locomotion. Many enzymes manifest this by presenting reactants in optimal conformations thus increasing reaction rates. Another class of enzymes called motor-proteins employ steric factors and multi-step cycles to ensure the reverse reac- tions seldom occur. This arrangement allows the motor-proteins to walk along micro- tubule matrix scaolding every cell. Kinesin in particular presents itself as a valuable tool because of its ubiquity through the cellular matrix and the tremendous body of research at the physiological, biochemical, and thermodynamic levels. 164 −21.4 −21.2 −21 −20.8 −20.6 −20.4 −20.2 0 1 2 3 4 5 6 x 10 −7 72 74 76 78 80 82 84 log Noise Dispersion ATPase Travel vs. Load Force & Noise Dispersion Load Force Number of Steps (x8 nm) Figure 6.4: Cauchy noise dispersion and load force plot similar to figure 6.3 but loading values increased by 1000. Only the highest loading and smallest noise dispersion decrease the diusion of kinesin driven with Cauchy noise. The proposed simulation would extend the detail and range of the results. Kinesin resides in one of two primary phases: bound and unbound (Figure 6.5). The entropy release following the binding and phosphorylation of ATP mediate the transition from bound to unbound. During the unbound state the free “heads” search the energy landscape for global energy minima. These minima correspond to the bound state. But the structure and mechanics of the molecule prevent retrograde steps thus each cycle slides forward the kinesin 4nm along the microtubule. The latent heat from the surrounding system drives the Brownian search. In our pro- posal we would inject impulsive noise into the kinesin-microtubule system. We believe that system will benefit because the impulsive noise will drive the diusion limited search and converge to the next global minima at a faster rate. The nanometer scale of the system lends itself further to the hypothesis because the system lives in the low 165 Figure 6.5: Kinesin moves at 8.3nm steps along a microtubule scaold [373]. During the search portion the locamotive mechanism (illustrated in insets 2 and 4 above, [8]) environmental noise propels the individual heads of kinesin. We hypothesize that the addition of tuned-stable noise during this phase of the movement cycle will enhance the turnover speed. Our hypothsis further suggests that certain types of noise will com- pletely disrupt this search process and halt kinesin movement. Reynolds number regime. Impulsive contributions to the Langevin Equation dominate inertial components. These leads to complete supression of non-impulsive forces acting 166 on a body. Thus we anticipate that impulsive noise will drive the protein through its search process and increase the rate of movement beyond the loading conditions that typical Brownian motion allows. Additional benefit of the noise such as increased turn-over of ATP into ADP due to increases in local diusion or changes to the underlying mass-action reactions might also exist. Observing artifacts of these facts would imply a fundamental misunderstand- ing of the kinesin motion and indicate an alternate mechanism for its motion. Leveraging kinesin’s role as a transport molecule provides further reason for interest. In vivo kinesin plays a central role in cellular logistics and facilitates the “fast transport” systems in most cells. Better understanding these mechanisms at the chemical level should lead to an number of exploitable pathways across all cellular systems. An extensive body of research allows accurate simulations of kinesin locomotion. The predominance of kinesin as a cellular transport pushes a substantial body of research on the protein. Accurate simulations currently capture the chemical mechanisms of the process despite the fact that some debate on the stepping mechanism exists. Our pro- posal aims to touch on questions about the diusive search kinesin undergoes. Support- ing evidence will come in the form of agreements or disagreements with predictions made by the prevailing search models. Eliminating these knowledge gaps will allow exploitation of similar systems and perhaps lend understanding of the mechanism perti- nent to development of artificial molecular transport systems driven entirely by noise. Manipulation of kinesin by atomic force microscopy (AFM) allows direct study of the motor-protein under environmental conditions. Extending these procedures with optical tweezers provides functional characterization of kinesin’s loading potential. To confirm our mathematical and simulation results we will apply the above techniques. Some possible methods for injecting noise into the molecular system include: 1. Additive noise via dynamic loading using optical tweezers. 2. Secondary directed energy source injecting impulsive shot-noise. 3. Chemical entropy reservoir such as ATP reversibly bound to a ligand that easily dissociates (perhaps using directed energy similar to above). 4. Constructing a cantilever platform similar to Basso [15] to eectively move the stage below the molecule. 167 6.3 Noisy Genetic Algorithms to explore NP-complete Problems Genetic and evolutionary algorithms search sample spaces for global minimum by iter- ated recombination and mutation operation. Based loosely on the “survival of the fittest” paradigm the solutions that best optimize the fitness measure survive in some form to the next generation. New organisms exhibiting superior fitness have the ability to drive the entire population to toward the minimum. Earlier research on parameter tuning investigated the application of a two-phase genetic algorithm [11, 340]. The algorithms directed mutations with impulsive Cauchy noise to launch organisms into new regions of the search space. They followed with Gaussian noise after a heuristically pre-determined limit because the smoothness of the Brownian diusions allowed the solution to settle towards the nearest (hopefully global) minimum. Our preliminary study considered a population of 150 organisms. Each genome consisted of a single real allele in the range of 0 to 30:2. The single allele reduced the problem to one with a recombination probability equal to 0. This emulates a popu- lation of asexual organisms that reproduce by mutating and splitting instead of mating. We drove the mutations with symmetric -stable noise for between 2.0 (Gaussian) and 1.2 (near Cauchy). The noise was generated using the Chambers-Mallows-Stuck method [362, 46]. We allowed each simulation to run for 500 generations. The fitness function shown in Figure 6.6 specified each organisms objective fitness. The algorithm ranked the organisms in each generation based on the non-perturbed objective function and generated successive generations using a roulette type selection giving the fittest individuals a mass proportional to their rank. We constructed our fitness function as a piecewise continuous set of scaled sinusoids. The symmetry around the minima ensures that truncation of organisms that mutate out- side of the 0 to 30:2 bound do not achieve the global minimum. The global minima occurs about 13 units from the next nearest sub-optimal minima. The depth and width of this minima presents a strong diculty for the standard algorithm because Gaussian mutations seldom occur with sucient magnitudes to propel the solution into the global trough. More advanced genetic methods often attempt to overcome this by starting with 168 Figure 6.6: Our preliminary genetic simulations searched the above cost surface for the global minimum at around 15. The narrow well decreased the chance that the algorithm would start with an organism in the deep center trough. Global minimum in higher dimensional spaces should not need to exhibit this “golf-course” pathology because the expanse of the search space should confound the initial conditions. The spacing between each side lobe and the global minimum is approximately 13 units so Brownian mutations seldom permitted these jumps. The jumpiness during the Lévy search process regularly pushed the solution to this minimum allowing the simulation to converge toward the proper result. a high-variance Gaussian mutation noise and diminish the noise power using a time- dependent scaling factor. This practice introduces several problems though because of the heuristic selection of an initial variance and the inclusion of an arbitrary scaling parameter. This supposes an expert knowledge of the search space and does not permit the application of GAs to many sparse search-spaces. The presence of strong minimas further restricts the application of the algorithm to problems without many “deceptive minimas” [132, 131, 130]. 169 Figure 6.7.(a) shows the performance of the algorithm over the tested range of . The figure captures the average absolute error from 200 simulations performed at each depicted. In over 99.9% of the trials either the simulation obtained the global mini- mum or one of the surrounding troughs. From this we can infer from the plot that the for = 2 (Gaussian noise) the simulation achieved the global minimum less than 1 in 6 runs (18%). The figure also shows a clear correspondence between the “jumpiness” (lower alpha) and the rate of convergence. The noise approaching Cauchy demonstrates continual refinement of the solution over the 500 iteration trial. Figure 6.7.(b) shows the performance of the lower alpha levels when run to 2000 generations. They reach a steady-steady state around 1000 iterations but they display continued progress toward the global minimum out to the full 2000 generations. We also infer from this figure that the experiment reaches global minimum better than 1 out of 2 and longer runs approach 2 out of 3 (67%). Figure 6.7.(c) explores the convergence gap apparent between the Gaussian and = 1:8 stable-noise using a third round of simulations. These simulations exhibit the same flat-line behavior of the Gaussian curve and also for mutating noise “near”- Gaussian (>= 1:92). They also point to the existence of a discontinuity around the alpha 1.90 level indicating that a given search space requires some minimal amount of jumpiness to overcome local barriers in the landscape. We propose further research to characterize this correspondence and plan to study the relation for an arbitrary search space. Understanding this pairing should allow us to define an adaptive search procedure that learns the optimal noise distribution as it advances. Our preliminary studies will extend these results by including multiple-allele genomes and cross-over events. These additions will allow us to explore more com- plex search spaces and determine the benefits of-stable noise there. We also will con- sider the inclusion of a noisy-stable fitness-function. An NGA generalizes the GA by assuming a noisy fitness function but assumes this noise follows normal ( = 2) distri- bution. We believe that including an optimal additive low- noise with the fitness score (or during the fitness ranking) will benefit the algorithm especially searching a deceptive or sparse-minimum cost-surface. The noise will occasionally allow less-fit individuals to get more consideration during the roulette selection and allow them to survive to the next generation. Normally these individuals perish during the next selection phase because their fitness fails to meet the cuto during the next round. Under the Gaussian 170 (a) impulsive noise increases convergence rate (b) matched performance for 1:2<< 1:4 (c) performance discontinuity near = 1:90 Figure 6.7: Figure (a) shows the rate of convergence of the genetic algorithm to the cost function in Figure 6.6 averaged over 200 simulations with varying. The lower noise demonstrates a proportional convergence rate increase over the higher noise. The shows that the genetic algorithms with mutations driven by Brownian noise obtain the global minimum only 30% as often even over this smooth uni-dimensional cost surface. Figures (b) and (c) examine the behavior of the algorithm near the extremal values of to look for possible discontinuities. Figure (b) covers the lower range from 1:40– 1:20 and exhibits the characteristic trend that lower noise tends to converge faster. Figure (c) covers the higher range from 2:00–1:80 and shows that a discontinuity may occur near = 1:90. Thus for this problem the algorithm can dramatically improve its performance by tuning the mutator to employ a driving noise with< 1:9. 171 noise process low individuals cannot jump far enough in ranking to survive unless the simulation implements a very high variance. But this leads to a degradation of the rank- ing process and can cause the loss of the fittest individuals. But adding impulsive noise and a legacy-parameter should allow these individuals to contribute their diversity to the genetic pool for several generations but fade after a few generations if they do not yield improvements toward optimizing the global fitness. The impulsive nature of the alpha noise will allow us to set a much lower dispersion parameter but still provide the lowest fitness levels with chance opportunities to survive. We will also apply our past research results on Lévy distributions to study their eect on the algorithms search [273]. The results from these trials should apply to genomes with arbitrary allele datatypes and extend our studies from real-valued vector alleles. After we establish methods characterizing the relation between the cost-function and optimal stable-noise distributions we will study adaptive methods to control the noise-index to maintain optimal searching as we traverse the local terrain of the search- space. We will then apply these results to classical high-dimensional minimization problems like the traveling salesman problem (TSP), job-scheduling problem, or vertex cover problem in theoretical computer science. Because of the NP-complete nature of these problems they present tremendous search spaces with large numbers of ambiguous results (strongly local minima) [74, 48]. 172 References [1] Christopher J Adams et al. “Zeolite map: The new detergent zeolite”. In: Progress in Zeolite and Microporous Materials, Preceedings of the 11th Inter- national Zeolite Conference. Ed. by Son-Ki Ihm Hakze Chon and Young Sun Uh. V ol. 105. Elsevier, New York, 1997, pp. 1667 –1674. [2] S. Ahmad. “Carbon nanostructures fullerenes and carbon nanotubes”. In: IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India) 16.3-4 (1999), pp. 297–310. [3] E. Alvarez-Ayuso and A. Garcia-Sanchez. “Removal of heavy metals from waste waters by natural and Na-exchanged bentonites”. In: Clays and Clay Min- erals 51.5 (2003), pp. 475–480. [4] Christophe Andrieu et al. “An introduction to MCMC for machine learning”. In: Machine Learning 50.1-2 (2003), pp. 5–43. [5] Anon. “Nanotubes going soft”. In: Chemistry World 1.7 (2004), p. 25. [6] David Applebaum. “Extending Stochastic Resonance for Neuron Models to General Lévy Noise”. In: IEEE Transactions on Neural Networks 20.12 (2009), pp. 1993–1995. [7] A. I. Arnon and P. R. Stout. “The essentiality of certain elements in minute quantity for plants with special reference to copper”. In: Plant Physiology 141 (1939), pp. 371–375. [8] C. L. Asbury, A. N. Fehr, and S. M. Block. “Kinesin Moves by an Aysmmetric Hand-Over-Hand Mechanism”. In: Science 302.5653 (2003), pp. 2130–2134. [9] Krishna B Athreya and P Ney. “A new approach to the limit theory of recurrent Markov chains”. In: Transactions of the American Mathematical Society 245 (1978), pp. 493–501. 173 [10] E. Álvarez Ayuso, A. García-Sánchez, and X. Querol. “Purification of metal electroplating waste waters using zeolites”. In: Water Research 37.20 (2003), pp. 4855 –4862. [11] Thomas Back and Martin Schutz. “Intelligent Mutation Rate Control in Canoni- cal Genetic Algorithms”. In: International Syposium on Methodologies for Intel- ligent Systems. 1996, pp. 158–167. [12] Christian Baerlocher, Lynne B. McCusker, and David H. Olson. Atlas of Zeolite Framework Types. 6th. New York: Elsevier, 2007. [13] M. R. Bakker and C. Nys. “Eect of liming on fine root cation exchange sites of oak”. In: Journal of plant nutrition 22.10 (1999), pp. 1567–1575. [14] Srinivasan Balaji, Hosam M. Mahmoud, and Osamu Watanabe. “Distributions in the Ehrenfest process”. In: Statistics & Probability Letters 76.7 (2006), pp. 666 –674. [15] M Basso et al. “Modelling and analysis of autonomous micro-cantilever oscilla- tions”. In: Nanotechnology 19.47 (2008), p. 475501. [16] G.J. Baxter, R.A. Blythe, and A.J. McKane. “Exact solution of the multi-allelic diusion model”. In: Mathematical Biosciences 209.1 (2007), pp. 124 –170. [17] C. Bayer et al. “Changes in soil organic matter fractions under subtropical no- till cropping systems”. In: Soil Science Society of America journal 65.5 (2001), pp. 1473–1478. [18] F. E. Bear. Chemistry of the Soil. 2nd. New York: Nostrand Reinhold Co., 1964. [19] Isabel Beichl and Francis Sullivan. “The metropolis algorithm”. In: Computing in Science & Engineering 2.1 (2000), pp. 65–69. [20] J. M. Bell et al. “Priming eect and C storage in semi-arid no-till spring crop rotations”. In: Biology and fertility of soils 37.4 (2003), pp. 237–244. [21] Roberto Benzi, Alfonso Sutera, and Angelo Vulpiani. “The mechanism of stochastic resonance”. In: Journal of Physics A: mathematical and general 14.11 (1981), p. L453. [22] J. Berg, J. Tymoczko, and L. Stryer. Biochemistry. 5th. New York: W. H. Free- man and Company, 2001. 174 [23] Henri Berthiaux and Vadim Mizonov. “Applications of Markov Chains in Partic- ulate Process Engineering: A Review”. In: The Canadian Journal of Chemical Engineering 82.6 (2004), pp. 1143–1168. [24] A. Bianco and M. Prato. “Can Carbon Nanotubes Be Considered Useful Tools for Biological Applications?” In: Advanced Materials 15.20 (2003), pp. 1765– 1768. [25] A. Bielanski and A. Malecka. “Cumene cracking on NaH—Y and NaH-ZSM–5 type zeolites as catalysts”. In: Zeolites 6.4 (1986), pp. 249 –252. [26] Patrick Billingsley. Probability and Measure. 3rd. New York: Wiley- Interscience, 1995. [27] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008. [28] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [29] A. B. Boricha et al. “Facile dihydroxylation of styrene using clay based cata- lysts”. In: Applied Catalysis A: General 179.1-2 (1999), pp. 5–10. [30] E. Borowiak-Palen et al. “Synthesis and Electronic Properties of B-doped Single Wall Carbon Nanotubes”. In: Carbon 42.5 (2004), pp. 1123–1126. [31] P. Bose and D. A. Reckhow. “Modeling pH and Ionic Strength Eects on Proton and Calcium Complexation of Fulvic Acid: A Tool for Drinking Water-NOM Studies”. In: Environmental Science and Technology 31.3 (1997), pp. 765–770. [32] R. Bouabid, M. Badraoui, and R. R. Bloom. “Potassium fixation and charge characteristics of soil clays”. In: Soil Science Society of America journal 55.5 (1991), pp. 1493–1498. [33] Leo Breiman. Probability. 1st ed. Reading, Massachusets: Addison-Wesley, 1968. [34] S. Brin and L. Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. In: Proceedings of the Seventh International World-Wide Web Confer- ence (WWW 1998). Brisbane, Australia: International World-Wide Web Confer- ence Committee IW3C2, 1998. 175 [35] J. B. Brower, R. L. Ryan, and M. Pazirandeh. “Comparison of ion-exchange resins and biosorbents for the removal of heavy metals from plating fac- tory wastewater”. In: Environmental Science and Technology 21.10 (1997), pp. 2910–2914. [36] D. A. Brown and W. A. Albrecht. Plant nutrition and the hydrogen ion. Columbia, MO: University of Missouri, College of Agriculture, Agricultural Experiment Station, 1955. [37] James Bucklew. Introduction to rare event simulation. Springer Science & Busi- ness Media, 2013. [38] Adi R. Bulsara and Anthony Zador. “Threshold detection of wideband signals: A noise-induced maximum in the mutual information”. In: Phys. Rev. E 54.3 (1996), R2185–R2188. [39] A Burks, H Goldstein, and J von Neumann. Logical Design of an Electronic Computing Instrument. 1946. [40] F. Cadena, R. Rizvi, and R. W. Peters. “Feasibility studies for the removal of heavy metals from solution using tailored bentonite”. In: Proceedings of the 22nd Mid-Atlantic Industrial Waste Conference. Philadelphia, PA, 1990, pp. 77– 94. [41] W.-Q. Cai and J.-F. Chean. “Application of nanometer materials in catalysts field”. In: Modern Chemical Industry 22.suppl. (2002), pp. 34–37. [42] J. M. Carrick, B. A. Kashemirov, and C. E. McKenna. “A photolabile rroika ester: o-Nitrobenzyl (E)-(Hydroxyimino)(dihydroxyphosphinyl) acetate”. In: Tetrahedron 56.16 (2000), pp. 2391–2396. [43] Antonio Carrieri et al. “Binding models of reversible inhibitors to type-B monoamine oxidase”. In: Journal of Computer-Aided Molecular Design 16.11 (2002), pp. 769–778. [44] Rolando Castro and Tim Sauer. “Chaotic stochastic resonance: noise-enhanced reconstruction of attractors”. In: Physical review letters 79.6 (1997), p. 1030. [45] J. Cezikova et al. “Humic acids from coals of the North-Bohemian coal field - II. Metal-binding capacity under static conditions”. In: Reactive and Functional Polymers 47.2 (2001), pp. 111–118. 176 [46] J. M. Chambers, C. L. Mallows, and B. W. Stuck. “A Method for Simulating Stable Random Variables”. In: Journal of the American Statistical Association 71.354 (1976), pp. 340–344. [47] J.M. Chambers, C.L. Mallows, and B.W. Stuck. “A method for simulating stable random variables”. In: J. of the American Stat. Assoc. 171.354 (1976). [48] Lance Chambers. Practical Handbook of Genetic Algorithms. Boca Raton, FL: CRC Press, 1999. [49] M. H. Chantigny, D. A. Angers, and P. Rochette. “Fate of carbon and nitrogen from animal manure and crop residues in wet and cold soils”. In: Soil biology and biochemistry 34.4 (2002), pp. 509–517. [50] François Chapeau-Blondeau. “Noise-enhanced capacity via stochastic reso- nance in an asymmetric binary channel”. In: Physical Review E 55.2 (1997), p. 2016. [51] François Chapeau-Blondeau, Solenna Blanchard, and David Rousseau. “Fisher information and noise-aided power estimation from one-bit quantizers”. In: Dig- ital Signal Processing 18.3 (2008), pp. 434–443. [52] François Chapeau-Blondeau and David Rousseau. “Noise improvements in stochastic resonance: From signal amplification to optimal detection”. In: Fluc- tuation and noise letters 2.03 (2002), pp. L221–L233. [53] A. Chatterjee, T. Iwasaki, and T. Ebina. “2:1 dioctahedral smectites as a selective sorbent for dioxins and furans: Reactivity index study”. In: Journal of Physical Chemistry A 106.4 (2002), pp. 641–648. [54] A. V . Chechkin et al. “Generalized fractional diusion equations for acceler- ating subdiusion and truncated Lévy flights”. In: Phys. Rev. E 78.2 (2008), p. 021111. [55] Hao Chen et al. “Theory of the Stochastic Resonance Eect in Signal Detec- tion: Part I–Fixed Detectors”. In: IEEE Trans. on Signal Process. 55.7 (2007), pp. 3172–3184. [56] Dante R Chialvo, André Longtin, and Johannes Müller-Gerking. “Stochastic resonance in models of neuronal ensembles”. In: Physical review E 55.2 (1997), p. 1798. 177 [57] R. B. Clark and S. K. Zeto. “Mineral acquisition by arbuscular mycorrhizal plants”. In: Journal of plant nutrition 23.7 (2000), pp. 867–902. [58] R. B. Clark et al. “Mineral acquisition by maize grown in acidic soil amended with coal combustion products”. In: Communications in soil science and plant analysis 32.11-12 (2001), pp. 1861–1884. [59] D. T. Clarkson and U. Luttge. “Mineral nutrition: anions”. In: Progress in botany 49 (1997), pp. 68–86. [60] Robert Cogburn et al. “The central limit theorem for Markov processes”. In: Proc. Sixth Berkeley Symp. Math. Statist. Probab. V ol. 2. 1972, pp. 485–512. [61] Carmine Colella. “Natural zeolites”. In: Zeolites and Ordered Mesoporous Materials: Progress and Prospects. Ed. by J. Cejka and H. van Bekkum. V ol. 157. Studies in Surface Science and Catalysis. Prague, Czech Republic: Elsevier, 2005, pp. 13 –40. [62] JAMES J Collins, THOMAS T Imho, and PETER Grigg. “Noise-enhanced information transmission in rat SA1 cutaneous mechanoreceptors via aperiodic stochastic resonance”. In: Journal of Neurophysiology 76.1 (1996), pp. 642–645. [63] JJ Collins, Carson C Chow, and Thomas T Imho. “Aperiodic stochastic reso- nance in excitable systems”. In: Physical Review E 52.4 (1995), R3321. [64] JJ Collins, Carson C Chow, Thomas T Imho, et al. “Stochastic resonance with- out tuning”. In: Nature 376.6537 (1995), pp. 236–238. [65] JJ Collins et al. “Aperiodic stochastic resonance”. In: Physical Review E 54.5 (1996), p. 5575. [66] University of Cologne. Spin Glass Server. [67] John M. Conroy et al. “Chromosome identification using hidden Markov mod- els: comparison with neural networks, singular value decomposition, principal components analysis, and Fisher discriminant analysis”. In: Laboratory Investi- gation 80.11 (2000), pp. 1629–1641. [68] Cristian Covarrubias et al. “Removal of trivalent chromium contaminant from aqueous media using FAU-type zeolite membranes”. In: Journal of Membrane Science 312.1-2 (2008), pp. 163 –173. 178 [69] Mary Kathryn Cowles and Bradley P. Carlin. “Markov Chain Monte Carlo Con- vergence Diagnostics: A Comparative Review”. In: Journal of the American Statistical Association 91.434 (1996), pp. 883–904. [70] Colin S. Cundy and Paul A. Cox. “The hydrothermal synthesis of zeolites: Pre- cursors, intermediates and reaction mechanism”. In: Microporous and Meso- porous Materials 82.1-2 (2005), pp. 1 –78. [71] Andre Robert Dabrowski and Adam Jakubowski. “Stable Limits for Associated Random Variables”. In: The Annals of Probability 22.1 (1994), pp. 1–16. [72] Qi Dai, Xiao qing Liu, and Tian ming Wang. “Analysis of protein sequences and their secondary structures based on transition matrices”. In: Journal of Molecu- lar Structure: THEOCHEM 803.1-3 (2007), pp. 115 –122. [73] J. J. Davis et al. “Chemical and biochemical sensing with modified single walled carbon nanotubes”. In: Chemistry - A European Journal 9.16 (2003), pp. 3732– 3739. [74] Kenneth A. De Jong and William M. Spears. “Using Genetic Algorithm to solve NP-Complete Problems”. In: Proc. of the Third Int. Conf. on Genetic Algo- rithms. 1989, pp. 124–132. [75] C. Dekker. “Carbon nanotubes as molecular quantum wires”. In: Physics Today 52.5 (1999), pp. 22–28. [76] Holger Dette. “On a Generalization of the Ehrenfest Urn Model”. In: Journal of Applied Probability 31.4 (1994), pp. 930–939. [77] S. J. Deverel and R. Fujii. “Chemistry of trace elements in soils and ground water”. In: American Society of Civil Engineers, Manuals and Reports on Engi- neering Practice 71 (1990), pp. 64–90. [78] Persi Diaconis, Gilles Lebeau, and Laurent Michel. “Geometric analysis for the metropolis algorithm on¢ aLipschitz domains”. In: Inventiones mathematicae 185.2 (2011), pp. 239–281. [79] T. S. Dierfolf, L. M. Arya, and R. S. Yost. “Water and cation movement in an Indonesian ultisol”. In: Agronomy journal 89.4 (1997), pp. 572–579. 179 [80] Daniel Diermeier and Jan A. Van Mieghem. “V oting with your Pocketbook - A Stochastic Model of Consumer Boycotts”. In: Mathematical and Computer Modelling 48 (2008), pp. 1497–1509. [81] Johannes M. Dieterich. “Empirical Review of Standard Benchmark Functions Using Evolutionary Global Optimization”. In: Applied Mathematics 03.October (2012), pp. 1552–1564. [82] O. Ditlevsen. “Invalidity of the Spectral Fokker–Planck Equation for Cauchy Noise Driven Langevin Equation”. In: Probabilistic Engineering Mecahnics 19 (2004), pp. 385–392. [83] Petar M Djuri´ c and Joon-Hwa Chun. “An MCMC sampling approach to esti- mation of nonstationary hidden Markov models”. In: Signal Processing, IEEE Transactions on 50.5 (2002), pp. 1113–1123. [84] Wolfang Doeblin. “Exposé de la théorie des chaınes simples constantes de Markova un nombre fini dâ ˘ A ´ Zétats”. In: Mathématique de lâ ˘ A ´ ZUnion Inter- balkanique 2.77-105 (1938), pp. 78–80. [85] Rona J. Donahoe and J.G. Liou. “An experimental study on the process of zeolite formation”. In: Geochimica et Cosmochimica Acta 49.11 (1985), pp. 2349 – 2360. [86] Joseph L Doob. Stochastic processes. V ol. 101. New York Wiley, 1953. [87] L. Duclaux. “Review of the doping of carbon nanotubes (multiwalled and single- walled)”. In: Carbon 40.10 (2002), pp. 1751–1764. [88] P. D. Duy and J. D. Schreiber. “Nutrient leaching of a loblolly pine forest floor by simulated rainfall. II. Environmental factors”. In: Forest science 36.3 (1990), pp. 777–789. [89] Rick Durrett. Probability: Theory and Examples. 3rd. Florence, KY: Thompson Brooks/Cole, 2005. [90] C. Duwig et al. “Water dynamics and nutrient leaching through a cropped Fer- ralsol in the Loyalty Islands (New Caledonia)”. In: Journal of environmental quality 29.3 (2000), pp. 1010–1019. 180 [91] Bartłomiej Dybiec, Adam Kleczkowski, and Christopher A. Gilligan. “Mod- elling control of epidemics spreading by long-range interactions”. In: J R Soc Interface 6.39 (2009), pp. 941–50. [92] A Dyer and H Faghihian. “Diusion in heteroionic zeolites: part 1: Diusion of water in heteroionic natrolites”. In: Microporous and Mesoporous Materials 21.1-3 (1998), pp. 27 –38. [93] Alan Dyer and David Keir. “Nuclear waste treatment by zeolites”. In: Zeolites 4.3 (1984), pp. 215 –217. [94] Martin Dyer, Alan Frieze, and Ravi Kannan. “A random polynomial-time algo- rithm for approximating the volume of convex bodies”. In: Journal of the ACM (JACM) 38.1 (1991), pp. 1–17. [95] Roger Eckhardt. “Stan ulam, john von neumann, and the monte carlo method”. In: (). [96] B Efron. The jackknife, the bootstrap, and other resampling plans. V ol. 38. SIAM, 1982, p. 92. [97] Barry Eichengreen et al. “How the subprime crisis went global: evidence from bank credit default swap spreads”. In: Journal of International Money and Finance 31.5 (2012), pp. 1299–1318. [98] M. J. Eick, W. D. Brady, and C. K. Lynch. “Charge properties and nitrate adsorp- tion of some acid southeastern soils”. In: Journal of environmental quality 28.1 (1999), pp. 138–144. [99] M. Endo et al. “Anode performance of a Li ion battery based on graphitized and B-doped milled mesophase pitch-based carbon fibers”. In: Carbon 37 (1999), pp. 561–568. [100] M. Endo et al. “Recent development of carbon materials for Li ion batteries”. In: Carbon 28 (2000), pp. 183–197. [101] P. J. van Erp, V . J. G. Houba, and M. L. van Beusichem. “Actual cation exchange capacity of agricultural soils and its relationship with pH and content of organic carbon and clay”. In: Communications in soil science and plant analysis 32.1-2 (2001), pp. 19–31. 181 [102] Marco Falcioni and Michael W. Deem. “A biased Monte Carlo scheme for zeo- lite structure solution”. In: Journal of Chemical Physics 110.3 (1999), pp. 1754– 1766. [103] Zhaozhi Fan. “Parameter Estimation of Stable Distributions”. In: Communica- tions in Statistics - Theory and Methods 35.2 (2006), pp. 245–255. [104] Martin Feldkircher. “The Determinants of Vulnerability to the Global Financial Crisis 2008 to 2009: Credit growth and other sources of risk”. In: Journal of International Money and Finance 43 (2014), pp. 19–49. [105] W. Feller. “Diusion Processes in Genetics”. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Ed. by J. Ney- man. Berkeley, CA: University of California Press, 1951, pp. 227–246. [106] William Feller. An Introduction to Probability Theory and Its Applications. 3rd ed. V ol. II. Wiley, 1968. [107] William Feller. An introduction to probability theory and its applications. 2nd. V ol. 2. New York, NY , USA: John Wiley & Sons, Inc., 2008. [108] Ronald Aylmer Fisher. The Genetical Theory of Natural Selection. 2nd. New York: Dover, 1958. [109] M. Flytzani-Stephanopoulos, I.-S. Nam, and X. Verykios (edited by). “Applica- tion of zeolites to clean up nuclear waste”. In: Applied Catalysis B: Environmen- tal 5.4 (1995), N34 –N35. [110] A. Fonseca et al. “Synthesis of single- and multi-wall carbon nanotubes over supported catalysts”. In: Applied Physics A: Materials Science and Processing 67.1 (1998), pp. 11–22. [111] R. F. Fox and M. H. Choi. “Rectified Brownian Motion and Kinesin Motion Along Microtubules”. In: Physical Review E 63.051901 (2001), pp. 1–12. [112] Richard V . Gaines et al. DANA’s New Mineralogy. 8th. New York: John Wiley & Sons, 1997. [113] Luca Gammaitoni. “Stochastic resonance in multi-threshold systems”. In: Physics Letters A 208.4 (1995), pp. 315–322. [114] Hu Gang et al. “Stochastic resonance in a nonlinear system driven by an aperi- odic force”. In: Physical review A 46.6 (1992), p. 3250. 182 [115] F. Ganry et al. “Management of soil organic matter in semiarid Africa for annual cropping systems”. In: Nutrient cycling in agroecosystems 61.1-2 (2001), pp. 105–118. [116] J. M. Garcia-Mina, M. C. Antolin, and M. Sanchez-Diaz. “Metal-humic com- plexes and plant micronutrient uptake: a study based on dierent plant species cultivated in diverse soil types”. In: Plant and soil 258 (2004), pp. 57–68. [117] A. E. Gelfand and A. F. M. Smith. “Sampling-Based Approaches to Calculat- ing Marginal Densities”. In: Journal of the American Statistical Assocation 85 (1990), pp. 398–409. [118] Andrew Gelman and Donald B Rubin. “Inference from iterative simulation using multiple sequences”. In: Statistical science (1992), pp. 457–472. [119] S. Geman and D. Geman. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 (1984), pp. 721–741. [120] Pierre-Gilles de Gennes. “Relaxation Anomalies in Linear Polymer Melts”. In: Macromolecules 35.9 (2002), pp. 3785–3786. [121] V . Georgakilas et al. “Organic Derivatization of Single-Walled Carbon Nan- otubes by Clays and Intercalated Derivatives”. In: Carbon 42.4 (2004), pp. 865– 870. [122] John Geweke et al. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. V ol. 196. 1991. [123] Charles J Geyer. “On the asymptotics of constrained M-estimation”. In: The Annals of Statistics (1994), pp. 1993–2010. [124] H. Ghobarkar and O. Schäf. “Hydrothermal synthesis of laumontite, a zeolite”. In: Microporous and Mesoporous Materials 23.1-2 (1998), pp. 55 –60. [125] H. Ghobarkar and O. Schäf. “Synthesis of gismondine-type zeolites by the hydrothermal method”. In: Materials Research Bulletin 34.4 (1999), pp. 517 –525. [126] H. Ghobarkar et al. “Zeolite synthesis by simulation of their natural formation conditions: from macroscopic to nanosized crystals”. In: Journal of Solid State Chemistry 173.1 (2003), pp. 27 –31. 183 [127] W. R. Gilks et al. Markov chain Monte Carlo in practice. London: CRC Press, 1996. [128] G. P. Gillman. “A proposed method for the measurement of exchange properties of highly weathered soils”. In: Australian Journal of Soil Research 17 (1979), pp. 129–139. [129] G. P. Gillman and E. A. Sumpter. “Modification to the Compulsive Exchange Method for Measuring Exchange Characteristics of Soils”. In: Australian Jour- nal of Soil Research 24.1 (1986), pp. 61–66. [130] David E. Goldberg. “Construction of High-order Deceptive Functions using Low-order Walsh Coecients”. In: Annals of Mathematics and Artificial Intel- ligence 5 (1992), pp. 35–48. [131] David E. Goldberg. “Genetic algorithms and walsh functions: Part II, deception and its analysis”. In: Complex Systems 3 (1989), pp. 153–171. [132] David E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley Professional, 1989. [133] Google, Inc. Annual Report, Form 10-K. 2009. [134] Google, Inc. Annual Report, Form 10-K. 2010. [135] K. Górska and K.A. Penson. “Lévy stable distributions via associated integral transform”. In: Journal of Mathematical Physics 53.5 (2012), p. 053302. [136] D. Gournis et al. “Incorporation of Fullerene Derivatives into Smectite Clays: A New Family of Organic-Inorganic Nanocomposites”. In: Journal of the Ameri- can Chemical Society 126.27 (2004), pp. 8561–8568. [137] D. J. Greenland and M. H. B. Hayes. The Chemistry of Soil Constituents. Chich- ester, United Kingdom: John Wiley and Sons, 1978. [138] Mircea Grigoriu. Applied non-Gaussian processes: examples, theory, simula- tion, linear random vibration, and MATLAB solutions. Englewood Clis, N. J.: Prentice Hall, 1995. [139] G. Grimmett and D. Stirzaker. Probability and Random Processes. 3rd. New York: Oxford University Press, 2001. 184 [140] G. Guanghua. “Energetics, structure, mechanical and vibrational properties of single-walled carbon nanotubes”. In: Nanotechnology 9.3 (1998), pp. 184–191. [141] Marco Guerriero et al. “Speedier sequential tests via stochastic resonance.” In: ICASSP. New York: IEEE, 2008, pp. 3901–3904. [142] Marco Guerriero et al. “Stochastic resonance in sequential detectors”. In: Trans. Sig. Proc. 57.1 (2009), pp. 2–15. [143] V . P. Gutschick. “Nutrient-limited growth rates: roles of nutrient-use eciency and of adaptations to increase uptake rate”. In: Journal of experimental botany 44.258 (1993), pp. 41–51. [144] Peter Hall. The bootstrap and edgeworth expansion. New York: Springer-Verlag, 1992. [145] Peter Hänggi. “Stochastic Resonance in Biology”. In: Chem. Phys. Chem. 3 (2002), pp. 285–290. [146] W. K. Hastings. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. In: Biometrika 57 (1970), pp. 97–109. [147] Daniel B. Hawkins. “A first-order Markov-chain model of zeolite crystalliza- tion”. In: Clays and Clay Minerals 37.5 (1989), pp. 433–438. [148] T. Hayashi et al. “Structure and application of flourinated carbon nanofibers”. In: Materials Research Society Symposium - Proceedings 633 (2001), A14181– A14184. [149] Chow Heneghan et al. “Information measures quantifying aperiodic stochastic resonance”. In: Physical Review E 54.3 (1996), R2228. [150] H.-J. Herbert and H. C. Moog. “Cation exchange, interlayer spacing, and water content of MX-80 bentonite in high molar saline solutions”. In: Engineering Geology 54.1-2 (1999), pp. 55–65. [151] M. V . Hickman. “Long-term tillage and crop rotation eects on soil chemical and mineral properties”. In: Journal of plant nutrition. 25.7 (2002), pp. 1457– 1470. [152] R. Hilfer. “H-function representations for stretched exponential relaxation and non-Debye susceptibilities in glassy systems”. In: Phys. Rev. E 65.6 (2002), p. 061510. 185 [153] David B Hitchcock. “A history of the Metropolis–Hastings algorithm”. In: The American Statistician 57.4 (2003). [154] Mark D. Hollingsworth. “Crystal Engineering: from Structure to Function”. In: Science 295.5564 (2002), pp. 2410–2413. [155] Yen-Ting Hu, Rudiger Kiesel, and William Perraudin. “The estimation of tran- sition matrices for sovereign credit ratings”. In: Journal of Banking & Finance 26.7 (2002), pp. 1383 –1406. [156] C. P. Huang, C. R. O’Melia, and J. J. Morgan, eds. Aquatic Chemistry. Advances in chemistry series 244. Washington D.C.: American chemical society, 1995. [157] Da-hai Huang and Pi-e Zheng. “MCMC-based comparison of two classes of volatility models [J]”. In: Journal of Systems Engineering 4 (2004), p. 014. [158] Songfang Huang and Steve Renals. “Hierarchical Bayesian language models for conversational speech recognition”. In: Audio, Speech, and Language Process- ing, IEEE Transactions on 18.8 (2010), pp. 1941–1954. [159] Eddie CM Hui and Xian Zheng. “Exploring the dynamic relationship between housing and retail property markets: an empirical study of Hong Kong”. In: Journal of Property Research 29.2 (2012), pp. 85–102. [160] K. Inoue and C. Satoh. “Surface charge characteristics of hydroxyaluminosilicate- and hydroxyaluminum-montmorillonite complexes”. In: Soil Science Society of America journal 57.2 (1993), pp. 545–552. [161] C. V . Toner IV, D. L. Sparks, and T. H. Carski. “Anion exchange chemistry of Middle Atlantic soils: Charge properties and nitrate retention kinetics”. In: Soil Science Society of America journal 53.4 (1989), pp. 1061–1067. [162] R. Jacquemin et al. “Doping mechanism in single-wall carbon nanotubes studied by optical absorption”. In: Synthetic Metals 115.1 (2000), pp. 283–287. [163] Anil K Jain, Robert PW Duin, and Jianchang Mao. “Statistical pattern recogni- tion: A review”. In: Pattern Analysis and Machine Intelligence, IEEE Transac- tions on 22.1 (2000), pp. 4–37. [164] Naresh Jain and Benton Jamison. “Contributions to Doeblin’s theory of Markov processes”. In: Probability Theory and Related Fields 8.1 (1967), pp. 19–40. 186 [165] A. Janicki and A. Weron. Simulation and Chaotic Behavior of-stable Stochas- tic Processes. New York, New York: Karcel Dekker, Inc., 1994. [166] Mark Jerrum and Alistair Sinclair. “The Markov chain Monte Carlo method: an approach to approximate counting and integration”. In: Approximation algo- rithms for NP-hard problems (1996), pp. 482–520. [167] J. B. Jones. Plant nutrition manual. New York: CRC Press, 1998. [168] K. P. De Jong. “Synthesis of supported catalysts”. In: Current Opinion in Solid State and Materials Science 4.1 (1999), pp. 55–62. [169] T. E. Jordan, D. L. Correll, and D. E. Weller. “Nutrient interception by a riparian forest receiving inputs from adjacent cropland”. In: Journal of environmental quality 22.3 (1993), pp. 467–473. [170] J. D. Joslin, J. M. Kelly, and H. van Miegroet. “Soil chemistry and nutrition of North American spruce-fir stands: evidence for recent change”. In: Journal of environmental quality 21.1 (1992), pp. 12–30. [171] P. A. Moore Jr and W. H. Patrick Jr. “Calcium and magnesium availability and uptake by rice in acid sulfate soils”. In: Soil Science Society of America journal 53.3 (1989), pp. 816–822. [172] A. Z. Juhasz. “Some surface properties of Hungarian bentonites”. In: Colloids and Surfaces 49.1-2 (1990), pp. 41–55. [173] Sepandar Kamvar et al. “Extrapolation Methods for Accelerating PageRank Computations”. In: Proceedings of the Twelfth International World Wide Web Conference (WWW 2003). Budapest, Hungary: International World-Wide Web Conference Committee IW3C2, 2003. [174] A. Kapoor and T. Viraraghaven. “Use of immobilized bentonite in removal of heavy metals from wastewater”. In: Journal of Environmental Engineering 124.10 (1998), pp. 1020–1024. [175] Samuel Karlin and James McGregor. “Ehrenfest Urn Models”. In: Journal of Applied Probability 2.2 (1965), pp. 352–376. 187 [176] B. A. Kashemirov et al. “Troika acid derivatives: multifunctional ligands for metal complxation in solution and on sollid supports. A novel, linear trinickel (“Troitsa”) complex”. In: Phosphorus Sulfur Silicon and the Related Elements 177.10 (2002), pp. 2273–2274. [177] B. A. Kashemirov et al. “Troika acids: synthesis, structure, and stability of novel (Hydroxyimino)phosphonoacetic acids”. In: Journal of the American Chemical Society 117.27 (1995), pp. 7285–7286. [178] H. Katou. “A pH-dependence implicit formulation of cation- and anion- exchange capacities of variable-charge soils”. In: Soil Science Society of Amer- ica Journal 66.4 (2002), pp. 1218–1224. [179] H. Katou, B. E. Clothier, and S. R. Green. “Anion transport involving com- petitive adsorption during transient water flow in an andisol”. In: Soil Science Society of America journal 60.5 (1996), pp. 1368–1375. [180] S. Kaufhold et al. “Comparison of methods for the quantification of montmoril- lonite in bentonites”. In: Applied Clay Science 22.3 (2002), pp. 145–151. [181] S.M. Kay. “Can detectability be improved by adding noise?” In: IEEE Signal Process. Lett. 7.1 (200), pp. 8–10. [182] F. P. Kelly. Reversibility and Stochastic Networks. New York: John Wiley & Sons, 1979. [183] H. W. Kerr. “The nature of base exchange and soil acidity”. In: Journal of the American Society of Agronomy 20.4 (1928), pp. 309–335. [184] R. Kertzshcmar, D. Hesterberg, and H. Sticher. “Eects of adsorbed humic acid on surface charge and flocculation of kaolinite”. In: Soil Science Society of America journal 61.1 (1997), pp. 101–108. [185] C-H Kiang et al. “Carbon Nanotubes with Single-layer Walls”. In: Carbon 33.7 (1995), pp. 903–914. [186] V . J. Kilmer, O. E. Hays, and R. J. Muckenhirn. “Plant nutrient and water losses from fayette silt loam as measured by monolith lysimeters”. In: Journal of the American Society of Agronomy 36.3 (1944), pp. 249–263. 188 [187] D. Y . Kim and Y . H. Rhee. “Biodegradation of microbial and synthetic polyesters by fungi”. In: Applied microbiology and biotechnology 61.4 (2003), pp. 300–308. [188] Hyun Mun Kim and Bart Kosko. “Fuzzy prediction and filtering in impulsive noise”. In: Fuzzy Sets and Systems 77 (1 1996), pp. 15–33. [189] Scott Kirkpatrick, Mario P. Vecchi, and C. D. Gelatt. “Optimization by simulated annealing”. In: Science 220.4598 (1983), pp. 671–680. [190] A Yu Kitaev. “Quantum computations: algorithms and error correction”. In: Russian Mathematical Surveys 52.6 (1997), pp. 1191–1249. [191] Alexei Yu Kitaev, Alexander Shen, and Mikhail N Vyalyi. Classical and quan- tum computation. V ol. 47. American Mathematical Society Providence, 2002. [192] Martin J. Klein. “Entropy and the Ehrenfest urn model”. In: Physica 22.6-12 (1956), pp. 569 –575. [193] P. J. A. Kleinman, R. B. Bryant, and D. Pimentel. “Assessing ecological sustain- ability of slash-and-burn agriculture through soil fertility indicators”. In: Agron- omy journal 88.2 (1996), pp. 122–127. [194] S. M. Kogon and D. G. Manolakis. “Signal modeling with self-similar alpha- stable processes: the fractional Lévy stable motion model”. In: IEEE Transac- tions on Signal Processing 44.4 (1996), pp. 1006–1010. [195] G. F. Koopmans et al. “Phosphorus desorption dynamics in soil and the link to a dynamic concept of bioavailability”. In: Journal of environmental quality 33 (2004), pp. 1393–1402. [196] B. Kosko et al. “Nanosignal Processing: Stochastic Resonance in Carbon Nanotubes That Detect Subthreshold Signals”. In: Nano Letters 3.10 (2003), pp. 1683–1686. [197] Bart Kosko. Noise. New York: Viking, 2006. [198] Bart Kosko and Sanya Mitaim. “Robust stochastic resonance for simple thresh- old neurons”. In: Physical Review E 70.3 (2004), p. 031911. [199] Bart Kosko and Sanya Mitaim. “Robust Stochastic Resonance for Simple Threshold Neurons”. In: Physical Review E 70 (2004), pp. 031911–1 –031911– 10. 189 [200] Bart Kosko and Sanya Mitaim. “Robust Stochastic Resonance: Signal Detec- tion and Adaptation in Impulsive Noise”. In: Physical Review E 64 (2001), pp. 051110–1 –051110–11. [201] Bart Kosko and Sanya Mitaim. “Stochastic resonance in noisy threshold neu- rons”. In: Neural Networks 16.5 (2003), pp. 755–761. [202] S. C. Kou and S. G. Kou. “Modeling growth stocks via birth-death processes”. In: Advances in Applied Probability 35.3 (2003), pp. 641–664. [203] L. S. Koutika et al. “Chemical properties and soil organic matter assessment under fallow systems in the forest margins benchmark”. In: Soil biology and biochemistry 34.6 (2002), pp. 757–765. [204] A. M. L. Kraepiel, K. Keller, and F. M. M. Francois. “On the acid-base chemistry of permanently charged minerals”. In: Environmental Science and Technology 32.19 (1998), pp. 2829–2838. [205] O. Krat and M. Schaefer. “Mean passage times for triangular transition matri- ces and a two parameter Ehrenfest urn model”. In: Journal of Applied Probabil- ity 30 (1993), pp. 964–970. [206] N. V . Krylov. Introduction to the Theory of Diusion Processes. Providence, Rhode Island: American Mathematical Society, 1995. [207] K. H. Kuong. “Simulation of cation exchange involving hydrogen ion in soil”. In: Soil Science Society of America journal 58.4 (1994), pp. 1086–1094. [208] Sergio Ledesma, Jose Ruiz, and Guadalupe Garcia. “Simulated Annealing Evo- lution”. In: Simulated Annealing - Advances, Applications and Hybridizations. Ed. by Marcos de Sales Guerra Tsuzuki. InTech, 2012. [209] I. Lee et al. “Noise-enhanced Detection of Subthreshold Signals in Carbon Nan- otubes”. In: IEEE Transactions on Nanotechnology (to appear, 2005). [210] Ian Lee et al. “Noise-enhanced detection of subthreshold signals with carbon nanotubes”. In: IEEE Trans. Nanotechnol (2006), pp. 613–627. [211] R. S. Lee et al. “Conductivity enhancement in single-walled carbon nanotube bundles doped with K and Br”. In: Nature 388.6639 (1997), pp. 255–257. 190 [212] S. M. Lee et al. “A hydrogen storage mechanism in single-walled carbon nan- otubes”. In: Journal of the American Chemical Society 21.21 (2001), pp. 5059– 5063. [213] S. M. Lee et al. “Hydrogen adsorption and storage in carbon nanotubes”. In: Synthetic Metals 113.3 (2000), pp. 209–216. [214] J. Lehto et al. “Removal of heavy metals from metallurgical process and waste waters with selective ion exchangers”. In: Proceedings of the TMS Fall Extrac- tion and Processing Conference 3 (1999), pp. 2449–2458. [215] M. Lenarda et al. “Chemistry of [Ru 3 O 2 (NH 3 ) 14 ]C l6 (ruthenium red) interca- lated in a smectite clay. Thermal behaviour, reactivity with CO and CO/H 2 , cat- alytic activity”. In: Journal of Molecular Catalysis 67.3 (1991), pp. 295–307. [216] John Edward Lennard-Jones. “On the Determination of Molecular Fields. I. From the Variation of the Viscosity of a Gas with Temperature”. In: Proceed- ings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 106.738 (1924), pp. 441–462. [217] John Edward Lennard-Jones. “On the Determination of Molecular Fields. II. From the Equation of State of a Gas”. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 106.738 (1924), pp. 463–477. [218] Thomas Lenormand, Denis Roze, and François Rousset. “Stochasticity in evo- lution”. In: Trends in Ecology & Evolution 24.3 (2009), pp. 157 –165. [219] S. Letaief et al. “Fe-containing pillared clays as catalysts for phenol hydroxyla- tion”. In: Applied clay science 22.6 (2003), pp. 263–277. [220] Zoe A.D. Lethbridge et al. “Methods for the synthesis of large crystals of silicate zeolites”. In: Microporous and Mesoporous Materials 79.1-3 (2005), pp. 339 – 352. [221] Jacob E Levin and John P Miller. “Broadband neural encoding in the cricket cereal sensory system enhanced by stochastic resonance”. In: Nature 380.6570 (1996), pp. 165–168. [222] C. S. Lin et al. “Performance of randomized block designs in field experiments”. In: Agronomy journal 85.1 (1993), pp. 168–171. 191 [223] Yiqin Lin, Xinghua Shi, and Yimin Wei. “On computing PageRank via lumping the Google matrix”. In: Journal of Computational and Applied Mathematics 224.2 (2009), pp. 702 –708. [224] Martin Lindén and Mats Wallin. “Dwell Time Symmetry in Random Walks and Molecular Motors”. In: Biophysical Journal 92.11 (2007), pp. 3804 –3816. [225] Kristian Lindgren. “Microscopic and macroscopic entropy”. In: Phys. Rev. A 38.9 (1988), pp. 4794–4798. [226] Y . Liu, H. Yukawa, and M. Morinaga. “First-principles study on lithium absorp- tion in carbon nanotubes”. In: Computational Materials Science 30.1-2 (2004), pp. 50–56. [227] Andrew Lucas. “Ising formulations of many NP problems”. In: Frontiers in Physics 2.February (2014), pp. 1–15. [228] Mahindra T Makhija and Vithal M Kulkarni. “3D-QSAR and molecular mod- eling of HIV-1 integrase inhibitors”. In: Journal of computer-aided molecular design 16.3 (2002), pp. 181–200. [229] R. L. Malcom and V . C. Kennedy. “Variation of cation exchange capacity and rate with particle size in stream sediment”. In: Watter Pollution Control Federa- tion 42.2 (1970), R153–R160. [230] R. N. Mantegna and H. E. Stanley. “Scaling behavior in the dynamics of an economic index”. In: Nature 376 (1995), pp. 46–49. [231] Roman Martonák, Giuseppe Santoro, and Erio Tosatti. “Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random Ising model”. In: Physical Review B 66.9 (2002), pp. 1–8. [232] M. Matache et al. “Heavy metals contamination of soils surrounding waste deposits in Romania”. In: Journal De Physique. IV : JP 107.II (2003), pp. 851– 854. [233] A. D. Maynard et al. “Exposure to Carbon Nanotube Material I: Aerosol Release During the Handling of Unrefined Single Walled Carbon Nanotube Material”. In: Journal of Toxicological Environmental Health 67.1 (2004), pp. 87–107. 192 [234] Huston J. Mcculloch. “Financial applications of stable distributions”. In: Statis- tical methods in finance. V ol. 14. Handbook of Statistics. Amsterdam: North- Holland, 1996, pp. 393–425. [235] J. Huston McCulloch. “Simple consistent estimators of stable distribution parameters”. In: Communications in Statistics - Simulation and Computation 15.4 (1986), pp. 1109–1136. [236] Mark D. McDonnell et al. “Optimal information transmission in nonlinear arrays through suprathreshold stochastic resonance”. In: Physics Letters A 352.3 (2006), pp. 183 –189. [237] Mark D. McDonnell et al. Stochastic Resonance: From Suprathreshold Stochas- tic Resonance to Stochastic Signal Quantization. Cambridge, England: Cam- bridge University Press, 2008. [238] K. Mengel and E. A. Kirby. Principles of plant nutrition. 4th. Berne, Switzer- land: International potash institute, 1987. [239] N. Metropolis et al. “Equations of State Calculations by Fast Computing Machines”. In: Journal of Chemical Physics 21 (1953), pp. 1087–1091. [240] N. Metropolis et al. “Equations of State Calculations by Fast Computing Machines”. In: Journal of Chemical Physics 21 (1954), pp. 1087–1091. [241] Ralf Metzler and Joseph Klafter. “The restaurant at the end of the random walk: recent developments in the description of anomalous transport by fractional dynamics”. In: Journal of Physics A: Mathematical and General 37.31 (2004), R161–R208. [242] Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. 2nd. Cambridge, England: Cambridge University Press, 2009. [243] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. [244] Z. Mingwen. “Chemical reactivity of single-walled carbon nanotubes to amido- gen from density functional calculations”. In: Journal of Physical Chemistry B 108.28 (2004), pp. 9599–9603. 193 [245] Sanya Mitaim and Bart Kosko. “Adaptive Stochastic Resonance”. In: Proceed- ings of the IEEE: Special Issue on Intelligent Signal Processing. New York: IEEE, 1998, pp. 2152–2183. [246] Sanya Mitaim and Bart Kosko. “Adaptive stochastic resonance in noisy neurons based on mutual information”. In: IEEE Trans. Neural Netw (2004), pp. 1526– 1540. [247] Sanya Mitaim and Bart Kosko. “Noise-Benefit Forbidden-Interval Theorems for Threshold Signal Detectors based on Cross Correlations”. In: Phys Rev E 90.5 (2014), p. 052124. [248] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algo- rithms and Probabilistic Analysis. Cambridge, England: Cambridge University Press, 2005. [249] H. Mühlenbein, M. Schomisch, and J. Born. “The parallel genetic algorithm as function optimizer”. In: Parallel Computing 17.6-7 (1991), pp. 619–632. [250] P. Mukherjee and A. K. Sengupta. “Ion exchange selectivity as a surrogate indi- cator of relative permeability of ions in reverse osmosis processes”. In: Environ- mental Science and Technology 37.7 (2003), pp. 1432–1440. [251] N. Murayama et al. “Reaction, mechanism and application of various zeolite syntheses from coal fly ash”. In: Materials Transactions 44.12 (2003), pp. 2475– 2480. [252] M. Buongiorno Nardelli, J.-L. Fattebert, and J. Bernhole. “Quantum transport in nanotube-based structures”. In: Materials Research Society Symposium - Pro- ceedings 706 (2002), pp. 253–262. [253] R. L. L. Ness and P. L. G. Vlek. “Mechanism of calcium and phosphate release from hydroxy-apatite by mycorrhizal hyphae”. In: Soil Science Society of Amer- ica journal 64.3 (2000), pp. 949–955. [254] Michael A Nielsen and Isaac L Chuang. Quantum computation and quantum information. Cambridge university press, 2010. [255] Chrysostomos L Nikias and Min Shao. Signal processing with alpha-stable dis- tributions and applications. Wiley-Interscience, 1995. 194 [256] A. K. Nikumbh, R. V . Patil, and S. F. Patil. “Heavy metal analysis of surface waters, ground waters, soils and industrial euents from Pune area”. In: Aqua (Oxford) 47.5 (1998), pp. 245–253. [257] S. Nir et al. “Model for cation adsorption to clays and membranes”. In: Colloid and Polymer Science 272.6 (1994), pp. 619–632. [258] D. J. Nixon and L .P. Simmonds. “The impact of fallowing and green manuring on soil conditions and the growth of sugarcane”. In: Experimental agriculture 40 (2004), pp. 127–138. [259] J. P. Nolan. Stable Distributions — Models for Heavy Tailed Data. Boston: Birkhauser, 2011. [260] John P. Nolan. Stable v5.3 for MatLab. 2011. [261] Artem S. Novozhilov, Georgy P. Karev, and Eugene V . Koonin. “Biological applications of the theory of birth-and-death processes”. In: Briefings in Bioin- formatics 7.1 (2006), pp. 70–85. [262] Esa Nummelin. “Uniform and ratio limit theorems for Markov renewal and semi-regenerative processes on a general state space”. In: Annales de l’IHP Probabilités et statistiques. V ol. 14. 2. 1978, pp. 119–143. [263] Anneli Ojaj "arvi et al. “Estimating the relative risk of pancreatic cancer associated with exposure agents in job title data in a hierarchical Bayesian meta-analysis”. In: Scandinavian journal of work, environment & health (2007), pp. 325–335. [264] K. Oorts et al. “A new method for the simultaneous measurement of pH- dependent cation exchange capacity and pH buering capacity”. In: Soil Science Society of America journal 68 (2004), pp. 1578–1585. [265] Lawrence Page. U.S. Patent No. 6285999: Method for node ranking in a linked database. 2001. [266] JoséLuis Palacios and Prasad Tetali. “A note on expected hitting times for birth and death chains”. In: Statistics & Probability Letters 30.2 (1996), pp. 119 –125. [267] P. Parmananda et al. “Stochastic resonance of electrochemical aperiodic spike trains”. In: Phys. Rev. E 71 (2005), pp. 031110–031115. 195 [268] Ashok Patel and Bart Kosko. “Error-probability noise benefits in threshold neu- ral signal detection”. In: Neural Networks 22 (2009), pp. 697–706. [269] Ashok Patel and Bart Kosko. “Noise Benefits in Quantizer-Array Correlation Detection and Watermark Decoding”. In: IEEE Transactions on Signal Process- ing 59.2 (2011), pp. 488–505. [270] Ashok Patel and Bart Kosko. “Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation”. In: IEEE Signal Processing Letters 17.12 (2010), pp. 1005–1009. [271] Ashok Patel and Bart Kosko. “Optimal Noise Benefits in Neyman-Pearson and Inequality-Constrained Statistical Signal Detection”. In: IEEE Trans. on Signal Process. 57.5 (2009), pp. 1655–1669. [272] Ashok Patel and Bart Kosko. “Stochastic Resonance in Continuous and Spiking Neuron Models With Lévy Noise”. In: IEEE Trans. on Neural Networks 19.12 (2008), pp. 1993–2008. [273] Ashok Patel and Bart Kosko. “Stochastic Resonance in Continuous and Spiking Neuron Models with Levy Noise”. In: IEEE Transactions on Neural Networks to appear (2008). [274] Ashok Patel and Bart Kosko. “Stochastic resonance in noisy spiking retinal and sensory neuron models”. In: Neural Netw. 18.5-6 (2005), pp. 467–478. [275] Xing Pei, Lon Wilkens, and Frank Moss. “Noise-mediated spike timing preci- sion from aperiodic stimuli in an array of Hodgekin-Huxley-type neurons”. In: Physical review letters 77.22 (1996), p. 4679. [276] E. Peiter, F. Yan, and S. Schubert. “Are mineral nutrients a critical factor for lime intolerance of lupins?” In: Journal of plant nutrition 23.5 (2000), pp. 617– 635. [277] K. D. Pennell, R. D. Rhue, and A. G. Hornsby. “Competitive adsorption of para- xylene and water vapors on calcium, soidum, and lithium-saturated kaolinite”. In: Journal of environmental quality 21.3 (1992), pp. 419–426. [278] K.A. Penson and K. Górska. “Exact and Explicit Probability Densities for One- Sided Lévy Stable Distributions”. In: Phys. Rev. Lett. 105.21 (2010), p. 210604. 196 [279] U. R. Pillai and E. Sahle-Demessie. “Oxidation of alcohols over Fe3+/montmorillonite-K10 using hydrogen peroxide”. In: Applied Cataly- sis A: General 245.1 (2003), pp. 103–109. [280] F. van der Pol and B. Traore. “Soil nutrient depletion by agricultural production in Southern Mali”. In: Fertilizer research 36.1 (1993), pp. 79–90. [281] Santi Prestipino. “A probabilistic model for the equilibration of an ideal gas”. In: Physica A: Statistical Mechanics and its Applications 340.1-3 (2004), pp. 373 –379. [282] Graham Pulford, Rodney A. Kennedy, and Shin-Ho Chung. “Identification of individual channel kinetics from recordings containing many identical chan- nels”. In: Signal Processing 43.2 (1995), pp. 207 –221. [283] Hein Putter and Willem R. van Zwet. “Resampling: consistency of substitution estimators”. In: The Annals of Statistics 24.6 (1996), pp. 2297–2318. [284] X. Qin et al. “Electrochemical hydrogen storage of multiwalled carbon nan- otubes”. In: Electrochemical and Solid-State Letters 3.12 (2000), pp. 532–535. [285] Y . Qin et al. “Concise route to functionalized carbon nanotubes”. In: Journal of Physcial Chemistry: B 107.47 (2003), pp. 12899–12901. [286] Jule A. Rabo and Michael W. Schoonover. “Early discoveries in zeolite chem- istry and catalysis at Union Carbide, and follow-up in industrial catalysis”. In: Applied Catalysis A: General 222.1-2 (2001), pp. 261 –275. [287] S. Rachev and S. Mittnik. Stable Paretian Models in Finance. New York: Wiley, 2000. [288] Adrian E Raftery, Steven Lewis, et al. “How many iterations in the Gibbs sam- pler”. In: Bayesian statistics 4.2 (1992), pp. 763–773. [289] P. Ray, B. K. Chakrabarti, and Arunava Chakrabarti. “Sherrington-Kirkpatrick model in a transverse field: Absence of replica symmetry breaking due to quan- tum fluctuations”. In: Phys. Rev. B 39.16 (1989), pp. 11828–11832. [290] D. A. Reid et al. “Smectite mineralogy and charge characteristics along an arid geomorphic transect”. In: Soil Science Society of America journal 60.5 (1996), pp. 1602–1611. 197 [291] Eric Renshaw. “Metropolis-Hastings from a stochastic population dynamics per- spective”. In: Computational Statistics & Data Analysis 45.4 (2004), pp. 765 – 786. [292] J. D. Rhoades. “Methods of soil analysis, Part 2: Chemical and microbiological properties”. In: 2nd. Agronomy Monographs. Madison, WI: American Society of Agronomy, 1982. Chap. Cation exchange capacity, pp. 149–157. [293] Brian D Ripley. Stochastic simulation. V ol. 316. John Wiley & Sons, 2009. [294] C. R. Robert and G. Casella. Monte Carlo Statistical Methods. 2nd. New York: Springer, 2004. [295] Christian Robert and George Casella. “A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data”. In: Statistical Science 26.1 (2011), pp. 102–115. [296] Christian P Robert and George Casella. Monte Carlo statistical methods (Springer Texts in Statistics). 2nd. Springer-Verlag, 2005. [297] Gareth O Roberts. “A note on acceptance rate criteria for CLTs for Metropolis- Hastings algorithms”. In: Journal of Applied Probability (1999), pp. 1210–1217. [298] Gareth O Roberts and Jerey S Rosenthal. “Geometric ergodicity and hybrid Markov chains”. In: Electron. Comm. Probab 2.2 (1997), pp. 13–25. [299] Gareth O Roberts, Jerey S Rosenthal, et al. “General state space Markov chains and MCMC algorithms”. In: Probability Surveys 1 (2004), pp. 20–71. [300] A. Rochefort, D. R. Salahub, and P. Avouris. “Eects of finite length on the electronic structure of carbon nanotubes”. In: Journal of Physical Chemistry B 103.4 (1999), pp. 641–646. [301] E. Roose and B. Barthes. “Organic matter management for soil conservation and productivity restoration in Africa: a contribution from Francophone research”. In: Nutrient cycling in agroecosystems 61.1-2 (2001), pp. 159–170. [302] Jerey S Rosenthal. “A review of asymptotic convergence for general state space Markov chains”. In: Far East J. Theor. Stat 5.1 (2001), pp. 37–50. [303] Jerey S Rosenthal. “Quantitative convergence rates of Markov chains: A simple account”. In: Electronic Communications in Probability 7.13 (2002), pp. 123–128. 198 [304] Jerey Seth Rosenthal. A first look at rigorous probability theory. World Scien- tific, 2006. [305] David Rousseau and François Chapeau-Blondeau. “Constructive role of noise in signal detection from parallel arrays of quantizers”. In: Signal Process. 85.3 (2005), pp. 571–580. [306] L. A. Rowley, D. Nicholson, and N. G. Parsonage. “Monte Carlo grand canon- ical ensemble calculation in a gas-liquid transition region for 12-6 argon”. In: Journal of Computational Physics 17.4 (1975), pp. 401–414. [307] Aditya A. Saha and G. V . Anand. “Design of detectors based on stochastic res- onance”. In: Signal Process. 83.6 (2003), pp. 1193–1212. [308] B. Saha et al. “Sorption of trace heavy metals by thiol containing chelating resins”. In: Solvent Extraction and Ion Exchange 18.1 (2000), pp. 133–167. [309] D. T. Salvito et al. “Comparison of trace metals in the intake and discharge water of power plants using “clean” techniques”. In: Water Environment Research 73.1 (2001), pp. 24–29. [310] G. Samorodnitsky and M. S. Taqqu. Stable Non-Gaussian Random Processes: Stochastic models with infinite variance. Boca Raton, FL: Chapman and Hall/CRC, 2000. [311] M. J. Sanchez-Martin and M. Sanchez-Camazano. “Adsorption and mobility of cadmium in natural, uncultivated soils”. In: Journal of Environmental Quality 22.4 (1993), pp. 737–742. [312] J. C. Santamarina. Soils and Waves: Particulate Materials Behavior, Character- ization and Process Monitoring. Chichester, United Kingdom: John Wiley and Sons, 2001. [313] L. A. Schipper et al. “Anaerobic decomposition and denitrification during plant decomposition in an organic soil”. In: Journal of environmental quality 23.5 (1994), pp. 923–928. [314] Lawrence S Schulman. Techniques and applications of path integration. Courier Corporation, 2012. 199 [315] Klaus Schulten, Zan Schulten, and Attila Szabo. “Reactions governed by a bino- mial redistribution process–The Ehrenfest urn problem”. In: Physica A: Statis- tical and Theoretical Physics 100.3 (1980), pp. 599 –614. [316] Hans-Paul Schwefel. Numerical Optimization of Computer Models. New York, NY , USA: John Wiley & Sons, Inc., 1981. [317] Troels F. RÃÿnnow Sergio Boixo et al. “Evidence for quantum annealing with more than one hundred qubits”. In: Nature Physics 10 (2014), pp. 218–224. [318] M. A. Shepherd and G. Bennett. “Nutrient leaching losses from a sandy soil in lysimeters”. In: Communications in soil science and plant analysis 29.7-8 (1998), pp. 931–946. [319] Ronald W. Shonkwiler and Franklin Mendivil. Explorations in Monte Carlo Methods. New York: Springer, 2009. [320] D. E. Smika. “Fallow management practices for wheat production in the Central Great Plains”. In: Agronomy journal 82.2 (1990), pp. 319–323. [321] Adrian Smith et al. Sequential Monte Carlo methods in practice. Springer Sci- ence & Business Media, 2013. [322] Peter J Smith, Mansoor Shafi, and Hongsheng Gao. “Quick simulation: A review of importance sampling techniques in communications systems”. In: Selected Areas in Communications, IEEE Journal on 15.4 (1997), pp. 597–613. [323] I. M. Sokolov, J. Klafter, and A. Blumen. “Fractional Kinetics”. In: Physics Today 55.11 (2002), pp. 110000–55. [324] G. Solomons and C. Fryhle. Organic Chemistry. 7th. New York: John Wiley and Sons Inc, 2000. [325] Zhe Song et al. “Mining Markov chain transition matrix from wind speed time series data”. In: Expert Systems with Applications 38.8 (2011), pp. 10229 – 10239. [326] O. Sotolongo-Costa et al. “Lévy Flights and Earthquakes”. In: Geophys. Res. Lett. 27 (2002), pp. 1965–1967. [327] E. S. Sousa. “Performance of a spread spectrum packet radio network link in a Poisson field of interferers”. In: IEEE Transactions on Information Theory 38.6 (1992), pp. 1743–1754. 200 [328] R. R. Spalding et al. “Controlling nitrate leaching in irrigated agriculture”. In: Journal of environmental quality 30.4 (2001), pp. 1184–1194. [329] G. Sposito. The Surface Chemistry of Soils. New York: Oxford University Pub- lishing, 1984. [330] Rajan Srinivasan. Importance sampling: Applications in communications and detection. Springer Science & Business Media, 2013. [331] Martin Stemmler. “A single spike suces: the simplest form of stochastic res- onance in model neurons”. In: Network: Computation in Neural Systems 7.4 (1996), pp. 687–716. [332] W. M. Stewart and L. R. Hossner. “Factors aecting the ratio of cation exchange capacity to clay content in lignite overburden”. In: Journal of environmental quality 30.4 (2001), pp. 1143–1149. [333] Michael Stöcker. “Gas phase catalysis by zeolites”. In: Microporous and Meso- porous Materials 82.3 (2005), pp. 257 –292. [334] J. H. Suh and D. S. Kim. “Comparison of dierent sorbents (inorganic and bio- logical) for the removal of Pb2+ from aqueous solutions”. In: Journal of Chem- ical Technology and Biotechnology 75.4 (2000), pp. 279–284. [335] Yun Ju Sung and Charles J. Geyer. “Monte Carlo Likelihood Inference for Miss- ing Data Models”. In: The Annals of Statistics 35.3 (2007), pp. 990–1011. [336] Z. K. Tang et al. “Ultra-small single-walled carbon nanotubes and their super- conductivity properties”. In: Synthetic Metals 133-134 (2003), pp. 689–693. [337] Dan Taylor. Evolutionary Algorithms and the Traveling Salesman. http://logicalgenetics.com/eaintro/ProjectGA1.exe. 2015. [338] R. D. Taylor, P. J. Jewsbury, and J. W. Essex. “A review of protein-small molecule docking methods”. In: Journal of Computer-Aided Molecular Design 16.3 (2002), pp. 151–166. [339] Mauricio Terrones. “Science and Technology of the Twenty-First Century: Syn- thesis, Properties, and Applications of Carbon Nanotubes”. In: Annual Review of Materials Research 33 (2003), pp. 419–501. [340] Dirk Thierens. Adaptive Mutation Rate Control Schemes in Genetic Algorithms. 201 [341] L. Tierney. “Markov Chains for Exploring Posterior Distributions”. In: The Annals of Statistics 22 (1994), pp. 1701–1762. [342] H. Tiessen, E. V . S. B. Sampaio, and I. H. Salcedo. “Organic matter turnover and management in low input agriculture of NE Brazil”. In: Nutrient cycling in agroecosystems 61.1-2 (2001), pp. 99–103. [343] L. Tong. “Chemistry of carbon nanotubes”. In: Austrailian Journal of Chemistry 56.7 (2003), pp. 635–651. [344] G. A. Tsihrintzis and C. L. Nikias. “Fast estimation of the parameters of alpha- stable impulsive interference”. In: IEEE Transactions on Signal Processing 44.6 (1996), pp. 1492–1503. [345] G. A. Tsihrintzis and C. L. Nikias. “Performance of optimum and suboptimum receivers in the presence of impulsive noise modeled as an alpha-stable process”. In: IEEE Transactions on Communications 43.234 (1995), pp. 904–914. [346] Vladimir V Uchaikin and Vladimir M Zolotarev. “CHANCE and STABILITY Stable Distributions and their Applications”. In: (1999). [347] A. Vahedi-Faridi and S. Guggenheim. “Structural study of tetramethylphosphonium-exchanged vermiculite”. In: Clays and clay min- erals 47.2 (1999), pp. 219–225. [348] K. Rajkai Vegh. “Eect of soil water and nutrient supply on root characteristics and nutrient uptake of plants”. In: Developments in agricultural and managed- forest ecology 24 (1991), pp. 143–148. [349] L. C. Venema et al. “Imaging electron wave functions of quantized energy levels in carbon nanotubes”. In: Science 293.5398 (1999), pp. 52–55. [350] M. A. Vicente et al. “Preparation and characterisation of Mn- and Co-supported catalysts derived from Al-pillared clays and Mn- and Co-complexes”. In: Applied Catalysis A: General 267.1-2 (2004), pp. 47–58. [351] José MG Vilar and JM Rubi. “Stochastic multiresonance”. In: Physical review letters 78.15 (1997), p. 2882. [352] R. H. Walker. “The need for statistical control in soils experiments”. In: Journal of the American Society of Agronomy 29.8 (1937), pp. 650–657. 202 [353] Shaobin Wang and Yuelian Peng. “Natural zeolites as eective adsorbents in water and wastewater treatment”. In: Chemical Engineering Journal 156.1 (2010), pp. 11 –24. [354] Wenyi Wang et al. “PancPRO: risk assessment for individuals with a family history of pancreatic cancer”. In: Journal of clinical oncology 25.11 (2007), pp. 1417–1422. [355] Z. L. Wang. “Nano-scale mechanics of nanotubes, nanowires, and nanobelts”. In: Advanced Engineering Materials 3.9 (2001), pp. 657–661. [356] D. B. Warheit et al. “Comparative Pulmonary Toxicity Assessment of Single- wall Carbon Nanotubes in Rats”. In: Toxicological Sciences 77.1 (2004), pp. 117–125. [357] G. P. Warren and F. M. Kihanda. “Nitrate leaching and adsorption in a Kenyan Nitisol”. In: Soil use and management 17.4 (2001), pp. 222–228. [358] John Watrous. “Quantum Computational Complexity”. In: (2008), pp. 1–44. [359] D. Waxman. “Comparison and content of the Wright-Fisher model of random genetic drift, the diusion approximation, and an intermediate model”. In: Jour- nal of Theoretical Biology 269.1 (2011), pp. 79 –87. [360] D. Waxman. “Fixation at a locus with multiple alleles: Structure and solution of the Wright Fisher model”. In: Journal of Theoretical Biology 257.2 (2009), pp. 245 –251. [361] Jens Weitkamp. “Zeolites and catalysis”. In: Solid State Ionics 131.1-2 (2000), pp. 175 –188. [362] R. Weron. “On the Chambers-Mallows-Stuck method for simulating skewed sta- ble random variables”. In: Statistics and Probability Letters 28 (1996), 165– 171(7). [363] G. M. Whitesides and C. S. Weisbecker. “Measurements of Conductivity of Individual 10 nm Carbon Nanotubes”. In: Materials Research Symposium- Proceedings, vol. 349, Novel Forms of Carbon. 1994, pp. 263–268. [364] Darrell Whitley et al. “Artificial Intelligence Evaluating evolutionary algo- rithms”. In: Artificial Intelligence 85.1-2 (1996), pp. 245–276. 203 [365] Mark M Wilde and Bart Kosko. “Quantum forbidden-interval theorems for stochastic resonance”. In: J. Phys. A: Math. Theor. 42 (2009), pp. 465309– 465331. [366] R. C. Woollons and A. G. D. Whyte. “Analysis of forest fertilizer experiments: obtaining better precision and extracting more information”. In: Forest science (1988), pp. 769–780. [367] Sewall Wright. “Evolution in Mendelian Populations”. In: Genetics 16 (1931), pp. 97–159. [368] Minghong G. Wu et al. “Synthesis and Structure Determination by ZEFSAII of SSZ-55: A New High-Silica, Large-Pore Zeolite”. In: The Journal of Physical Chemistry B 106.2 (2002), pp. 264–270. [369] R.-H. Xie et al. “Tailorable acceptor C 60n B n and donor C 60m N m pairs for molecular electronics”. In: Physical Review Letters 90.20 (2003), pp. 206602.1– 206602.4. [370] Izumi Yamane and Tadahisa Nakazawa. “Development of Zeolite for Non- Phosphated Detergents in Japan”. In: New Developments in Zeolite Science and Technology, Proceedings of the 7th International Zeolite Conference. Ed. by A. Iijima Y . Murakami and J.W. Ward. V ol. 28. Studies in Surface Science and Catalysis. Elsevier, New York, 1986, pp. 991 –1000. [371] R. D. Yanai. “A steady-state model of nutrient uptake accounting for newly grown roots”. In: Soil Science Society of America journal 58.5 (1994), pp. 1562– 1571. [372] Jin Yang et al. “On Imposing Detailed Balance in Complex Reaction Mecha- nisms”. In: Biophysical Journal 91.3 (2006), pp. 1136 –1141. [373] Ahmet Yildiz et al. “Kinesin walks hand-over-hand”. In: Science 303.5658 (2004), pp. 676–678. [374] Q. Yujun. “Concise Route to Functionalized Carbon Nanotubes”. In: Journal of Physical Chemistry B 107.47 (2003), pp. 12899–12907. [375] Yeliz Yukselen-Aksoy. “Characterization of two natural zeolites for geotechni- cal and geoenvironmental applications”. In: Applied Clay Science 50.1 (2010), pp. 130 –136. 204 [376] M. Zekri and R. C. J. Koo. “Application of micronutrients to citrus trees through microirrigation systems”. In: Journal of plant nutrition 15.11 (1992), pp. 2517– 2529. [377] VM Zolotarev. One-dimensional stable distributions. V ol. 65. American Math- ematical Soc., 1986. [378] S. S. Zumdahl. Chemistry. 4th. Boston, MS: Houghton Miin, 2000. 205 Appendix A Bootstrap-based Estimation of the Bell-Curve Tail Thickness of Symmetric Alpha-Stable Random Variables A new bootstrap algorithm estimates the tail thickness of symmetric-alpha-stable prob- ability bell curves that model impulsive physical phenomena with energetic fluctua- tions. Special cases of symmetric-alpha-stable probability densities include the thin- tailed Gaussian and thick-tailed Cauchy bell curves. We call the algorithm the BEAST or bootstrap estimator of alpha-stable sequences algorithm. TheBEAST algorithm com- putes a test statistic from stable random samples and then matches a test statistic against a continuum of precomputed values to the find the estimated tail thickness. A theo- rem and a corollary show that the test statistic is invertible because it is a continuous bijection. So the bootstrapped estimator is consistent. The bootstrap structure allows theBEAST algorithm to estimate tail thicknesses based only a few samples. Simulations show that the algorithm performs similarly to others estimators on large samples. A.1 Estimating Symmetric-Stable Tail Thickness We develop a bootstrap-based algorithm that can estimate the bell-curve tail thicknesses or impulsiveness of symmetric-stable (SS) random samples [278, 33, 106, 138, 255]. The algorithm is especially eective for small sample sizes. SS probability density functions (pdfs) are symmetric bell curves whose tails get thicker as the parameter gets smaller for in (0;2]. The Gaussian pdf has the thinnest tails of all and corre- sponds to = 2. Figure A.1 shows this inverse relation between and tail thickness for four values: = 2, 1.4, 1 (the Cauchy case), and 0.4. The white-noise plots of 206 Figure A.2 show how controls the corresponding impulsiveness or fluctuations of the random samples. The algorithm estimates by interpolating a sample statistic between precomputed values. Thick-tailed SS pdfs have found many applications in physics and engineering where thicker tails can model energetic or impulsive processes [377, 346]. Natural sources of impulsive signals or noise include condensed and soft matter physics [241, 152, 120], geophysics [326], meteorology [310], biology [91], economics [234, 230, 287], fractional kinetics [323, 54], communications and signal processing [255, 327, 194, 345, 344, 188, 269], quantum communications [365], and noisy neural networks [272, 198, 246, 245, 201, 247] Most random models assume that the dispersion of a random variable equals its (finite) variance or its mean-squared deviation from the population mean. Impulsive signals or noise violate this finite-variance assumption in general. They have finite dis- persions but not finite variances or any finite higher-order moments. The moments of an-stable random variable are finite only up to order k for k< if< 2. The Gaus- sian-stable random variable alone has a finite second moment and finite higher-order moments. The Gaussian also has thin exponential tails while the other stable bell curves have thicker power-law tails. Finite-variance models may underestimate real-world fluc- tuation magnitudes. They may also wrongly dismiss important “rare” events as outliers. But a thicker-tail model may give up in mathematical tractability what it gains in accu- racy. The accurate choice of a bell-curve signal or noise model ultimately requires that empirical tests estimate the tail thickness. Appeal to the central limit theorem is not enough to decide the issue. The usual central limit theorem states that a standardized sum of finite-variance random variables converges in distribution to the standard normal or Gaussian random variable Z N (0;1) [139, 89]. The generalized central limit the- orem states a similar result for infinite-variance-stable random variables [71, 259]: a standardized sum of-stable random variables converges in distribution to an-stable random variable with the same. The generalized central limit theorem also holds only for-stable random variables. So pointing to a sum of random samples implies only that the underlying pdf is-stable. It alone does not imply that the pdf is the exponentially thin-tailed Gaussian. 207 There are two main problems with using stable pdfs in physical models. The first is that only a few stable pdfs have had a known closed form. These special cases include the SS pdfs of the Gaussian ( = 2) and the Cauchy or Lorentzian ( = 1) as well as the the asymmetric Levy ( = 0:5). So most algorithms have relied on power-series approx- imations [259]. Penson and Gorska made a recent breakthrough by finding closed-form pdfs for all rational values of [278, 135]. They represent such rational pdfs as finite sums of generalized hypergeometric functions. Their result largely overcomes the first problem. But it does not address the second and more practical problem: How does a user estimate from sample data? We present the bootstrap-driven BEAST (Bootstrap Estimate of Alpha STable) algo- rithm as a practical way to estimate from small or large data sets. Simulations show that that statistically consistent estimator outperforms other estimators on small data sets and performs similarly for large data sets (for sample sizes n 20). TheBEAST algo- rithm estimates by interpolating between precomputed values. Bootstrapping allows the algorithm to estimate from only a small set of random samples. Section A.3 presents the Bootstrap Estimator of-Stable Sequences (BEAST) algo- rithm. The BEAST algorithm estimates from a sequence of observed SS random samples. The algorithm computes the estimate b in two stages: (1) it constructs a map between a test statistic :!(X ) and (2) it uses the inverse map 1 :fX a g! for the observed samples to compute b . The-Stable Map Theorem in Section A.2 ensures that maps each to a distinct value. A corollary shows that the map is a bijection and thus has an inverse b = 1 ((g(X ))). The map is also continuous. So b converges in probability to and thus b is a consistent estimator because [ (X ) is a consistent estimator of(X ) [139]. Figure A.9 and Table A.1 show that the algorithm applies to SS random variables with 2 [0:2;2]. The algorithm estimates by bootstrapping and then inverting the bijection (Figure A.8). A theorem shows that each corresponds to a unique value of a sample statistic. A corollary to the theorem shows that the map is a bijection. The proof does rely on an asymptotic expansion for the sub-Cauchy case of< 2. The estimator b applies to all finite-length sequences of independent and identically distributed (i.i.d.) SS random variables. 208 Figure A.1: Symmetric -stable probability density functions. The figure shows SS probability density functions for = 2 (Gaussian), 1:4 (super-Cauchy), 1 (Cauchy), and 0:4 (sub-Cauchy). The thickness of the bell-curve tails increases as decreases. Thicker tails correspond to more impulsive sample with more energetic fluctuations. The Gaus- sian bell curve is the only SS probability density function with finite moments of order k 2. A.2 The-Stable Map Theorem This section states the main injection result and corollary that underlie the BEAST algo- rithm. The injection result relies in turn on three lemmas. The appendix contains all proofs. We begin with a general definition of -stable pdfs in terms of their characteristic functions or Fourier transforms. An-stable pdf f has characteristic function ' [255, 201, 278, 247]: '(!) = exp ia! j!j 1+ i sgn(!) (A.1) 209 Figure A.2: Impulsive samples from SS random variables with unit dispersion. The figure shows SS realizations for = 2 (Gaussian), 1:4, 1:0 (Cauchy), and 0:4. The scale diers by two orders of magnitude between = 2 and = 1. The scale diers by four orders of magnitude between = 1 and = 0:4. Only the Gaussian samples have finite variance and no impulsiveness. if = 8 > > > < > > > : tan 2 for, 1 2 lnj!j for = 1 ; (A.2) i = p 1, 0< 2,1 1, and > 0. The parameter is the characteristic exponent. It is a direct measure of the tail thickness for symmetric bell curves. The variance of an -stable density does not exist if < 2 even though such stable densities have lower- order fractional moments for all k<. The location parameter a is the median of a symmetric stable density. is a skewness parameter. The density is symmetric about a if = 0. Then controls the tail thickness of the resulting bell curve because the bell 210 curve has thicker tails as falls. The dispersion parameter acts like a variance because it controls the width of a symmetric-stable bell curve even though again such densities have no variance except in the Gaussian case of = 2. Numerical integration of gave the four probability densities f (n) in Figure 1. The infinite-variance pdfs with = 1:4 and = 1 can easily appear to the eye as finite-variance Gaussian bell curves. The Bootstrap Estimator of-Stable Sequences (BEAST) computes an estimate b of the tail-thickness parameter. It computes a sample statistic(X ) from a sequence of observed SS samples and then estimates through the inverse that maps from(X ) to. The-Stable Map Theorem guarantees that on average(X ) is distinct for two sequences of SS independent and identically distributed (i.i.d.) random variables X 1 and X 2 if 1 , 2 . The BEAST algorithm relies a corollary to ensure the inverse exists. The corollary thus allows the algorithm to estimate through 1 . The algorithm computes a test statistic for X that resembles an ` p vector norm. The test statistic is finite because the p th -sample moment of a finite sequence of such realizations is finite for any finite p> 0. Suppose X is a sequence of n i.i.d. SS random variables X with pdf f (x). Suppose the random variable has 2 (0;2] and has unit dispersion: = 1. Suppose further that n is finite. Define g p by the length-normalized sample` p -norm for finite p> 0: g p (x ) = 1 n kx k p p (A.3) wherekk p is the usual` p -norm onR n : kxk p p = n X k=1 jx k j p =jx 1 j p +jx 2 j p + +jx n j p : (A.4) The -Stable Map Theorem shows that g p is injective (1-to-1) with respect to . The corollary that follows shows that g p is a continuous bijection. It also shows how to construct the continuous inverse g 1 p . The BEAST algorithm uses this result to justify that b is a consistent estimator of. 211 Theorem A.1 (-stable Map Theorem). Suppose X 1 ;k and X 2 ;k are two indepen- dent sequences of n i.i.d. SS random variables with probability density functions f 1 (x) and f 2 (x). Assume unit dispersion: = 1. Define the “intersection” function C ( 1 ; 2 ) = x> 0 : f 1 (x) = f 2 (x) : (A.5) as the set of positive points where the two pdfs f 1 and f 2 intersect. Also define the “valid” set T (b) =f2 (0;2] : inf C (; ˜ )< b;8 ˜ 2 (0;2]g: (A.6) as the set of such that none of the corresponding SS pdfs intersect beyond b. That is C ( 1 ; 2 )< b for all 1 ; 2 2T (b). Define the “test” setA as any finite subset of T (b). Define the “gap” function d ( 1 ; 2 ) = Z b b x p f 1 (x) f 2 (x) dx (A.7) let D = maxfd ( 1 ; 2 ) : 1 ; 2 2Ag<1 (A.8) and define W ( 1 ; 2 ) = inf ( x> b : Z x b x p f 1 (x) f 2 (x) dx>jDj ) : (A.9) Define the sample function g p (fX k g) = 1 n kfX k gk p p = 1 n n X k=1 jX k j p : (A.10) Choose p> 1 and fix 0< b<1. Suppose 1 ; 2 2A . Then there exists an n 0 and H such that E h g p X 1 jmax n X 1 ;k o = h i = E h g p X 2 jmax n X 2 ;k o = h i (A.11) 212 for h> H and n n 0 only if 1 = 2 . Proof. Suppose 1 ; 2 2A and that 1 , 2 . Suppose further that E h g p X 1 jmax n X 1 ;k o = h i = E h g p X 2 jmax n X 2 ;k o = h i (A.12) where h maxfW ( 1 ; 2 ) : 1 ; 2 2Ag: (A.13) Denote the joint pdfs of X 1 ;k and X 2 ;k as f 1 (x 1 ;; x n ) = f 1 (x 1 ) f 1 (x n ) = f 1 (x) n (A.14) f 2 (x 1 ;; x n ) = f 2 (x 1 ) f 2 (x n ) = f 2 (x) n (A.15) Then E h g p X 1 jmax n X 1 ;k o = h i = E h g p X 1 I max n X 1 ;k o = h i P h max n X 1 ;k o = h i (A.16) since E [XjA] = E [X I A ] P[A] (A.17) 213 for a random variable X where I A denotes the indicator function for the event A. Thus E h g p X 1 jmax n X 1 ;k o = h i (A.18) = R R g p X 1 I max n X 1 ;k o = h f 1 X 1 dx 1 dx n P h max n X 1 ;k o = h i (A.19) = R h h R h h P n k=1 jx k j p f 1 (x 1 ) f 1 (x n ) dx 1 dx n P h max n X 1 ;k o = h i (A.20) = P n k=1 R h h jxj p f 1 (x) dx P h max n X 1 ;k o = h i (A.21) = n P h max n X 1 ;k o = h i Z h h jxj p f 1 (x) dx (A.22) Similarly for X 2 : E h g p X 2 jmax n X 2 ;k o = h i (A.23) = n P h max n X 2 ;k o = h i Z h h jxj p f 2 (x) dx: (A.24) Suppose 1 < 2 . Lemma A.1 below shows that there exists an n 0 such that P h max n X 1 ;k o = h i < P h max n X 2 ;k o = h i for all n n 0 and for all h B. Thus there exists n 0 such that E h g p X 1 jmax n X 1 ;k o = h i = E h g p X 2 jmax n X 2 ;k o = h i (A.25) for random sequences with length n n 0 only if Z h h jxj p f 1 (x) dx < Z h h jxj p f 2 (x) dx: (A.26) This implies that 1 > 2 since the contrapositive of Lemma A.2 states that R l l jxj p f 1 (x) f 2 (x) dx increases with l. Thus there is a contradiction since 1 < 2 by assumption and h> W ( 1 ; 2 ) for all 1 ; 2 2A . So 1 , 2 . The proof of the theorem relies on the following two lemmas to show that E h g p (X )jH (X ) = h i strictly decreases on. 214 Lemma A.1. Suppose X 1 and X 2 are two independent sequences of n i.i.d. SS ran- dom variables with unit dispersion ( = 1), probability density functions f 1 (x), and f 2 (x) and cumulative distribution functions F 1 (x) and F 2 (x). Suppose 1 ; 2 2 A T (B) for some 0< B<1 with 1 , 2 . Then there exists n 0 <1 such that P h max X 1 ;k = h i < P h max X 2 ;k = h i for all n n 0 and h B. Proof. Expanding the pdf of the maximum of a sequence of n i.i.d. random variables X = (X 1 ;; X n ) gives P[maxjfX k gj = h] = n X j=1 P h X j = h i 0 B B B B B B B @ n Y k=1;k, j P h X j h i 1 C C C C C C C A (A.27) = n X j=1 P h X j = h i (2F (h) 1) n1 (A.28) = n( f (h) + f (h))(2F (h) 1) n1 (A.29) = 2n f (h)(2F (h) 1) n1 (A.30) holds since X k are i.i.d. and symmetric and also P[jX k j x] = 2P[X k x]1 = 2F (x)1 for x 0. Suppose 1 < 2 . Suppose further that h B. The following claim shows f 1 (h)> f 2 (h). Claim A.1. f 1 (x)> f 2 (x) (A.31) for x> B Proof. Choose x 0 > B. Suppose the contrary f 1 (x 0 ) f 2 (x 0 ): (A.32) Clearly f 1 (x 0 ), f 2 (x 0 ) since x 0 > B and B> maxfC ( 1 ; 2 )g means that B is larger than every x such that f 1 (x) and f 2 (x). But x 0 > B also implies that f 2 (x) dominates f 1 (x) for all x B. The asymptotic tail theorem shows that f (x) x (+1) as x!1. 215 But this leads to a contradiction because if 1 < 2 then the asymptotic behavior implies that f 1 (x) dominates f 2 (x) since x ( 1 +1) > x ( 2 +1) as x!1. Therefore f 1 (h)> f 2 (h): (A.33) Therefore R = f 1 (h) f 2 (h) > 1: (A.34) Also F 1 (x)< F 2 (x) since f 1 (x)> f 2 (x) for all x> 2 and since F (x) = 1 R 1 x f (x) dx. Thus 2F 1 (x) 1< 2F 2 (x) 1. So s = 2F 1 (h) 1 2F 2 (h) 1 < 1: (A.35) The ratio of the pdfs for max X 1 and max X 2 is P h max X 1 = h i P h max X 2 = h i = 2n f 1 (h) 2F 1 (h) 1 n1 2n f 2 (h) 2F 2 (h) 1 n1 (A.36) = f 1 (h) f 2 (h) 2F 1 (h) 1 2F 2 (h) 1 ! n1 (A.37) = Rs n1 (A.38) Rs n1 goes to zero as n increases since s< 1 and 1< R<1. So there exists some n 0 such that Rs n < 1 for all n n 0 by definition of the limit. Thus P h max X 1 = h i < P h max X 2 = h i for all sequences with length n n 0 . Lemma A.2. Let p> 1. Suppose 1 ; 2 2A T (B) and 1 < 2 . Then there exists 0< B< h 0 <1 such that Z L L jxj p f 1 (x) f 2 (x) dx> 0 (A.39) for all L> h 0 where f i (x) is the pdf for the SS random variable. 216 Proof. Z L L jxj p f 1 (x) f 2 (x) dx = 2 Z L 0 x p f 1 (x) f 2 (x) dx (A.40) since bothjxj p and f 1 (x) f 2 (x) are symmetric. Splitting the integral gives 2 Z L 0 x p f 1 (x) f 2 (x) dx = Z B 0 x p f 1 (x) f 2 (x) dx + Z L B x p f 1 (x) f 2 (x) dx (A.41) = D+ Z L B x p f 1 (x) f 2 (x) dx (A.42) because the left hand term is a constant independent of L but with unknown sign. The right hand term is positive and increasing in L since f 1 (x) f 2 (x)> 0 for x> B by the earlier claim. Therefore it suces to show that Z l B x p f 1 (x) f 2 (x) dx>jDj (A.43) for some l> B. But Z B B x p f 1 (x) f 2 (x) dx = 0 (A.44) and Z 1 B x p f 1 (x) f 2 (x) dx =1 (A.45) implies that R L B x p f 1 (x) f 2 (x) dx increases smoothly from 0 to1 with L2 [B;1). Thus the intermediate value theorem shows that Z l 0 B x p f 1 (x) f 2 (x) dx =jDj (A.46) for some l 0 2 [B;1). Therefore Z l B x p f 1 (x) f 2 (x) dx>jDj (A.47) 217 for l> l 0 . TheBEAST algorithm uses the corollary below to show that the expected value of the -map is a bijection between 2 (0;2] and (X ). The algorithm exploits this fact to estimate with the 1 -map. Corollary A.1. Define g p (X) by (A.10). Define X 1 and X 2 as in the-Stable Estimate Map Theorem. Suppose that the conditions hold such that a finite n 0 exists. Suppose further that h<1. Then (X ) =G () = E h g p (X )jmax n X 1 ;k o = h i (A.48) is a bijection from2 (0;2] onto(X )2 [G (1);G (2)]. Proof. The-Stable Map Theorem shows thatG is injective (1-to-1).G is also contin- uous for> 1. Thus the Intermediate Value Theorem shows thatG is surjective onto [G (1);G (2)]. ThereforeG is a bijection since it is 1-to-1 and onto. SoG has an inverse functionG 1 ((X )) =b . A.3 TheBEAST Algorithm The Invertibility Corollary guarantees that the test statistic maps to a unique b on aver- age. The BEAST algorithm bootstraps to overcome the sensitivity of the algorithm to outliers caused by the finite sequence. The Invertibility Corollary also establishes that the-map is continuous for2 (0;2]. Thus the bootstrap estimatorb is a consistent esti- mator of in general [96, 283, 144] because [ (X ) is a consistent estimator of(X ). The BEAST algorithm consists of two stages: (1) it uses randomly generated sequences of SS observations to preconstruct a map between (X ) and and (2) it computes(X ) for the observations with unknown and then uses the 1 -map(X ) to estimate b . Stage 1 does not depend on the particular unknown random sequence X . So the algorithm preconstructs the map. The algorithm constructs :!(X ) by computing the statistic for representative SS sequences with 0< 2. Stage 2 uses (X ) to characterize the unknown signal X . It then maps(X ) to b with 1 ((X )). 218 Figure A.3 shows results from the test statistic (X ) computation for 2 [0:4;2]. The brackets show 90% confidence bands for(X ) from 50 independent sequences for each tested. The blue line shows the median of(X ). A.3.1 Stage 1: Construct-map The -map takes a finite sequence of i.i.d. SS random variables to a positive real number: (X ) :R N !R + (A.49) X 7! g(X ): (A.50) The algorithm computes(X ) for a representative set of values: 0< 1 2 M < 2. It then interpolates to find(X ) for2 (0;2] in general. 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 2.5 3 3.5 4 α τ (a) 1:0 2:0 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 α τ (b) 0:4 1:0 Figure A.3: The -map takes to (X ) and its inverse takes (X ) to b . This figure shows the median and 90% confidence bands for the-map on2 [0:4;2]. The algo- rithm uses the inverse map to determine b from an sequence of unknown SS random observations. The-Stable Map Theorem shows that the error bars will shrink toward the mean as the sequence length increases. The corollary to the theorem establishes that this map is continuous from to(X ) and that the inverse function exists. The-map also exists for all 2 (0;0:4) but we omit the figure because the double-exponential scaling would obscure the figure. 219 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 α Y (a) 0:4 1:0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 α Y (b) 1:0 2:0 Figure A.4: g p (X ) for p = 2. This corresponds to computing theL 2 norm (mean absolute value) of the data bootstrapped within the window. 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2000 4000 6000 8000 10000 α Y (a) 0:4 1:0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α Y (b) 1:0 2:0 Figure A.5: g p (X ) for p =1. This corresponds to computing theL 1 norm (mean absolute value) of the data bootstrapped within the window. The figures suggest that the algorithm becomes more sensitive to outliers as p increases above 1. For this reason we chose p = 1 for testing. The eects of fractional p are a topic of ongoing research. 220 Figure A.6 shows the value of t p;M (X ) computed on four dierent noise sources with dierent. The blue line represents the computed value of t p;M (X ) on each win- dow. The smooth red line shows the running average of the test statistic computed with a nine sample window. Figure A.7 repeats the experiment without the window so that the algorithm computes t p;M using all samples up to the present. The figure shows that the algorithm converges within a few thousand samples even for highly impulsive noise ( = 0:5) for large enough windows. Let ] (X ) be the loglog transformed(X ): ] (X ) = loglog(X ): (A.51) Figure A.8 shows a linear relation approximates the map from for>= 0:4. Piecewise- linear corrections for< 0:4 can eliminate this variation. Least squares linear regression gives the relation: b =0:3969 ] (X ) + 1:1764: (A.52) (R 2 = 0:9968). This gives the relation between(X ) and b a double exponential as b =0:3969 exp h e (X ) i + 1:1764: (A.53) A.3.2 Stage 2: Estimate from an unknown SS noise source Stage 2 estimates from the sequence of unknown SS random observation. Algorithm A.1 below specifies the estimation procedure. The process first computes(X ) and then maps(X ) to b by b =0:3969 log log(X ) + 1:1764: (A.54) TheBEAST algorithm can also use other representations of the-map such as a piecewise linear approximation or a lookup table. 221 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 1000 2000 3000 4000 5000 6000 7000 T Y (a) = 0:5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 T Y (b) = 1:0 (Cauchy) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 T Y (c) = 1:5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.2 0.4 0.6 0.8 1 1.2 T Y (d) = 2:0 (Gaussian) Figure A.6: Calculated value of the test statistic for three realizations of an i.i.d. SS random sequence having = 0:5, 1:0, and 2. Each point corresponds to the test statistic for a specific window W j . The blue line shows the calculated value and the red line shows a smoothed 9-sample running average. Each figure shows large fluctuations while the algorithm has access to only a limited number of samples (small t). This quickly settles down and remains around a steady state value. The signal will continue with about the same deviations as t!1 because the window length M is less than the signal length N. The consistency of the bootstrap operation ensures that the computed test statistic will approach the actual value as t!1. TheBEAST algorithm is the functional composition of the test statistic calculation shown in this figure and the inverse of the map shown in Figure A.3. 222 0 2000 4000 6000 8000 10000 0 1 2 3 4 5 6 time (t) τ (a)(X ) 0 2000 4000 6000 8000 10000 0 0.5 1 1.5 time (t) α (b) b (t) Figure A.7: Stage 2 of theBEAST algorithm. (a) The blue line shows(X ) as a function of time for an i.i.d. Cauchy random sequence. The smooth red line shows the mean (X ). Both will converge to the same value as t increases. (b) The BEAST algorithm uses the-map in figure A.3 to compute b . A.4 Experimental Results for b Simulations in this section show the BEAST algorithm applied to four observed SS sequences with2 [0:2;2]. Figure A.9 shows the evolution ofb as the number of obser- vations increases for SS signals with = 2, 1, 0:5, and 0:2. Table A.1 summarizes the performance of the estimator in the trials. The figures show that b is a robust estima- tor for for 0:2. We did not simulate the algorithm for < 0:2 because -stable random number generators often produce numeric overflows or fail for very low [47]. The experiments generated i.i.d. SS samples with the Stable MatLab Toolbox [260]. b b b 2 (Gaussian) 2.1057 1.9644 1.9618 1 (Cauchy) 0.9681 0.9756 0.9875 0.5 0.3887 0.4796 0.4888 0.2 0.1796 0.1797 0.1921 N = 1000 N = 5000 N = 10000 Table A.1: Performance of theBEAST algorithm Table A.1 shows that the algorithm underestimates in general. The error appears to arise from the linear approximation to the-map. (X ) tends toward1 as decreases 223 −2 −1 0 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 log log τ α Figure A.8: The BEAST algorithm can use a linear map to estimate . The blue line shows the loglog transform of the-map in figure A.3 The red line represents the least- squares linear approximation to the transformed -map: b =0:3969loglog(X ) + 1:1764. This linear approximation fails for< 0:4 because(X ) increases to1 as decreases to 0. The figure shows evidence of this by showing the transformed -map bend away from the line for 0:4. The algorithm can use other representations of the -map such as a piecewise linear approximation or a lookup table to correct this weakness. The algorithm could also use a hill-climbing technique to find b since the -map is strictly decreasing. to 0. This means that the loglog transform will curve upward and so the linear approxi- mation overestimates(X ). Thus the algorithm appears to underestimate on average since it computes b through the inverse map. 224 Algorithm A.1 TheBEAST Algorithm 1: procedure EstimateAlpha(X , R, s) 2: ComputeT(X ;R; s) 3: b MapTtoAlpha() 4: return b 5: procedure MapTToAlpha((X )) 6: m 0:3969 7: b 1:1764 8: b m log log + b 9: return b 10: procedure ComputeT(X , R, s) 11: N Length(X ) 12: for k 1;R do 13: S SubSample(X ; s) 14: V 0 15: for j 1; s do 16: V V +jS [ j]j 17: G [k] V N 18: return Center(G ) 19: procedure SubSample(X , s) 20: N Length(X ) 21: for k 1; s do 22: j RandomInteger(1; N) 23: S [k] X[ j] 24: return S Figure A.10 shows how the algorithm estimates(t) from a SS noise source with non-constant(t). Comparing Figures A.10.(a) and A.10.(b) also shows that the algo- rithm is robust to large discontinuities in(t). The simulation uses the same parameters as in the constant case and thus uses the same map to estimate. Bootstrapping from a wider sample window works against the algorithm here because the method assumes a fixed over the window. A wider window behaves like a low pass filter and smooths large changes in(t). The wider window also delays the estimate by about 1 2 the win- dow width. The delay is more apparent when(t) increases presumably because a few impulsive samples overwhelm the less impulsive earlier samples. Shifting the window 225 0 500 1000 1500 2000 2500 3000 0 0.5 1 1.5 2 2.5 3 time (t) α (a) = 2 (Gaussian) 0 500 1000 1500 2000 2500 3000 0 0.5 1 1.5 time (t) α (b) = 1 (Cauchy) 0 500 1000 1500 2000 2500 3000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 time (t) α (c) = 0:5 0 2000 4000 6000 8000 10000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 time (t) α (d) = 0:2 Figure A.9: b estimated from i.i.d. SS random observations with = 2 (Gaussian), 1 (Cauchy), 0:5, and 0:2. The blue line shows b as a function of time. The dotted red line shows the actual. The figures show rapid convergence of b to. Table A.1 shows the accuracy ofb at t = 1000, 5000, and 10000. The algorithm computesb every 10 samples using the cumulative vector x ;k : 1 k t . The solid red line shows the average value of b as a function of time: 1 t P t k=1 b [k]. forward by 1 2 M should correct for the first delay. This means that the algorithm would estimate(t) with samples t2 ( t 1 2 N;t 1 2 N + 1;:::;t + 1 2 N 1;t + 1 2 N ) (A.55) A.4.1 b compared to other estimators We found that the small sample bootstrap estimator b performs as well or better than four standard estimators. We performed a small sample (N = 10) experiment to compare 226 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 T α (a) Continuous(t) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 T α (b) Discontinuous(t) Figure A.10: Estimated value of for two non-i.i.d. SS random sequences with time dependent. The first figure shows that the algorithm can estimate and track a contin- uously varying(t). The sinusoid covers a spectrum of2 [0:7;2] and shows that the algorithm discriminates and identifies the slow fluctuations. The second figure shows that the algorithm can also estimate and track a highly discontinuous(t). The window width M plays a crucial role in estimation of non-constant (t) because the window serves a function similar to a low pass filter. The filtering spreads the information that individual rare events carry and may disregard some samples with very high information. We hypothesize that an adaptive algorithm could adjust the window width M between computations of t p;M (X ) to optimize the estimation of(t). This remains a topic for future research. theBEAST algorithm to a maximum likelihood estimatorb ms [259], a quantile based esti- mator b quant [235], an estimator using the empirical characteristic function b ec f [194], and a U-statistic-based estimator b ustat [103]. We generated new i.i.d. SS samples for each trial. Each of the five estimators used the samples to estimate. Figure A.11 shows the meanb from 1000 trials for 1 2. Table A.2 showsb for several2 [1;2] along with the error (b ). Figure A.11 and Table A.2 show that the BEAST algorithm per- forms as well as the quantile and U-statistic-based estimators for 1:8. The bootstrap estimator underestimates for 1:8 2. This is likely because we used simple linear approximation for(X ). By replacing the linear map with a piece-wise linear map the BEAST algorithm would better handle this underestimate (Figure A.8). 227 Figure A.11: Small sample (N = 10) comparison of b BEAST and four standard - estimators: b ms ,b quant ,b ec f , andb ustat . This shows the meanb BEAST from 1000 trials for each 1 2. It shows that the-map estimator performs as well as the quantile, and U-statistic-based estimators for 1:8. The -map underestimates for 1:8 2. We could replace the linear map with a piece-wise linear map to better handle this under- estimate. 228 b BEAST b ml b quant b ec f b ustat 1.0 (Cauchy) 0.9474 1.0819 1.0158 0.9608 1.1621 (-0.0526) (0.0819) (0.0158) (-0.0392) (0.1621) 1.2 1.2012 1.2555 1.2639 1.1959 1.4639 (0.0012) (0.0555) (0.0639) (-0.0041) (0.2639) 1.4 1.5186 1.4689 1.3590 1.3289 1.5770 (0.1186) (0.0689) (-0.041) (-0.0711) (0.177) 1.5 1.4492 1.5733 1.7276 1.5589 1.9000 (-0.0508) (0.0733) (0.2276) (0.0589) (0.4) 1.6 1.4637 1.7745 1.5854 1.5468 1.9166 (-0.1363) (0.1745) (-0.0146) (-0.0532) (0.3166) 1.8 1.6596 1.9999 1.6444 1.8639 2.0000 (-0.1404) (0.1999) (-0.1556) (0.0639) (0.2) 2.0 (Gaussian) 1.7861 1.9999 1.7037 2.0000 2.0000 (-0.2139) (0.0001) (-0.2963) (0) (0) Table A.2: Comparison of theBEAST algorithm 229 We also found that the bootstrap estimator is superior for extremely small sample sets (N 8). More samples washes away the benefit and all methods converge at about the same rate for N 20. We performed an experiment similar to the above but we instead generated samples from a fixed SS pdf ( = 1:5) to investigate the eects of sample size on accuracy and convergence rate. Figure A.12 shows the meanb from 1000 trials for 5 N 80. The BEAST algorithm produces the most accurate estimates for N 8. The bootstrap eectively increased the sample pool so that theBEAST algorithm performs reasonably over the entire range in this case. All the estimators converge at about the same rate toward = 1:5 for N 20. Table A.3 showsb for several sequences lengths 5 N 80 and the error (b ). Figure A.12: A comparison by varying sample size (5 N 80) of b BEAST and four standard-estimators: b ms , b quant , b ec f , and b ustat . b BEAST performs much better than all other estimators for extremely small sample sets (N 8). The bootstrap estimator eectively augments the sample pool so the algorithm performs reasonably over the entire range. For N 20 all the estimators converge at about the same rate to = 1:5. 230 b BEAST b ml b quant b ec f b ustat 5 1.5901 1.9999 2.0000 2.0000 1.1020 (0.0901) (0.4999) (0.5) (0.5) (-0.398) 10 1.6598 1.6999 1.5436 1.8580 1.3788 (0.1598) (0.1999) (0.0436) (0.358) (-0.1212) 20 1.6069 1.5857 1.3330 1.6412 1.5563 (0.1069) (0.0857) (-0.167) (0.1412) (0.0563) 40 1.5456 1.4876 1.3699 1.4906 1.4686 (0.0456) (-0.0124) (-0.1301) (-0.0094) (-0.0314) 100 1.5359 1.5265 1.4668 1.5510 1.5485 (0.0359) (0.0265) (-0.0332) (0.051) (0.0485) 150 1.4968 1.4986 1.4857 1.4934 1.4987 (-0.0032) (-0.0014) (-0.0143) (-0.0066) (-0.0013) 200 1.5002 1.4969 1.4829 1.5027 1.4746 (0.0002) (-0.0031) (-0.0171) (0.0027) (-0.0254) Table A.3: Comparison of theBEAST algorithm 231 Figure A.13: Small sample (N = 10) comparison of b BEAST and four standard - estimators: b ms , b quant , b ec f , and b ustat for non-SS samples. This shows the mean b from 1000 trials for each 1 2 with = 1. A.5 Conclusion A bootstrap algorithm can estimate the impulsiveness in a sequence of observed SS random samples. TheBEAST algorithm consists of two stages: (1) it constructs an invert- ible map from the test statistic(X ) to and (2) it computes(X ) for the unknown SS observations and then estimates b = 1 -map((X )). The -Stable Estimation Theorem shows that the-map is 1-to-1 with. A corollary shows that the-map is a bijection and so it has an inverse. Simulations show that the algorithm estimates b well if 0:2. Simulations also show that BEAST algorithm accurately estimates from a sequence of non-symmetric-stable random variables. Extensions of the BEAST algo- rithm may well estimate for non-symmetric-stable random variables or SS random variables with unknown dispersion . We hypothesize that a modified algorithm could 232 estimate a time-varying(t) from a sequence of non-i.i.d. SS random samples. The algorithm should localize changes in(t) if it bootstraps within a narrow sliding win- dow. Adaptive and other algorithms may be able to normalize or center the -map or they may be able to compute additional test statistics to estimate the parameters. Future research will also study the accuracy of b as a function of the bootstrap parameters: (1) the resampling size and (2) the number resampling iterations. 233 Appendix B Carbon Nanotubes for Increasing Soil Cation Exchange Capacity B.0.1 Carbon nanotubes should increase soil CEC Doping soil with carbon nanotubes should boost the soil’s cation exchange capacity (CEC) much as elements such as boron and phosphorus can dope silicon to produce a semiconductor. Nanotubes doped with electron acceptors provide reversible binding locations for low electronegativity cations in solution. Regions of electron deficiency over the nanotube’s length produce the increased cation anity. Artificial “doping” induces the deficiencies by substituting boron or a similar electron accepting species into the nanotube’s carbon skeleton. Introducing additional cation binding sites using nan- otubes should provide a semi-permanent and stable source of cation exchange. Appro- priately functionalized carbon nanotubes should similarly aect the anion exchange capacity (AEC). Boron-doped carbon nanotubes adsorb cations Carbon nanotubes (CNTs) have significant potential for chemistry because of their reac- tive surface, their very high conductance, and their large surface-area to volume ratio [168, 343, 374, 244, 73, 2, 196, 209]. Single-walled carbon nanotubes (SWCNTs) form with a single carbon shell [185]. The tubes are sheets of rolled graphene that have a width of only a few nanometers and a length of several microns or more [140]. The car- bon skeleton rigidly holds the carbon atoms fixed and also constrains-electrons to the outer shell [110, 140]. A length to width ratio on the order of 10 3 causes the electronic shell to behave like an unrestricted 1-D quantum system [355, 336, 300]. This enables the electrons to freely move along the tube and results in the high axial conductance observed in nanotubes [349, 252, 75, 363]. The outer shell of negative charge and large surface area also impart nanotubes with high adsorptive character [339]. This creates 234 possible applications for carbon nanotubes in hydrogen and methane fuel storage and drug delivery [213, 212, 284, 24, 5]. Doping carbon nanotubes with electron acceptors increases their ability to bind to cations [211, 87, 162, 369]. Boron replaces carbon atoms over the length of the tube when exposed to boron-oxide in an inert atmosphere of NH 3 and high temperatures [99]. Spectroscopic analysis shows that implanting produces homogeneous substitution and can approach 15% boron by weight [30]. The substitution of trivalent boron for 4- valent carbon produces a localized electron deficiency [100] but very little deformity of the carbon nanotube super-structure. An electrochemical study of boron doped carbon nanofibers demonstrated an 11% increase in the charge capacity of fibers substituted with 2.4% boron (by weight) [99]. The study also observed an increase in the cyclic binding eciency and in the specific ion capacity of low-electronegativity lithium after boron doping [226]. The increased electrostatic binding of lithium to boron doped CNT should apply to similar cations such as Ca 2+ and Mg 2+ . Table B.1 compares the ionic radii and elec- tronegativites of lithium, calcium, and magnesium. Calcium, magnesium, and lithium share similar properties because they are period 2-4 and group 1A and 2A members in the periodic table. The divalence of calcium and magnesium oset the decreased radius of monovalent lithium and so these ions should have Coulombic interactions with negative binding sites. We hypothesize that the lithium anity of boron-doped carbon nanotubes should equally apply to these other two ions based on their similar electronegativites and ionic radii. Cation Electronegativity Ionic Radius (Å) Li + 1.0 0.76 Ca 2+ 1.0 0.99 Mg 2+ 1.2 0.72 Table B.1: Ionic Properties. Ca 2+ , and Mg 2+ are similar to Li + .[324, 378] Nanotubes should increase the baseline CEC of soil The cation exchange capacity measures soil fertility [264, 193]. It specifically mea- sures the number of cation binding sites in the soil matrix in units of charge/weight 235 (meq/100g). Soil is a colloidal suspension of clay, minerals, and organic matter in water [137]. Soil takes place in a wide range of chemistry because of its large surface area— often as high as 800 m 2 =g [137]. Fixed and transient negative charge in the clay and organic matter creates binding locations for cations [160, 184]. Nutrient deficiencies occur in plants grown in soils with a low cation-buering capacity (low CEC) [171, 203, 151]. Clays consist of a regular repetition of tetrahedral and octahedral coordinations of oxygen on aluminum and silicon [32, 290]. The coordinated groups join to create lay- ers called platelets that form larger super-structures through ionic bonds [137]. Ionic coordinating-member substitution introduces net negative charge into the clay structure (e.g. ionic substitution of Mg 2+ into a location coordinated by Si 4+ creates a net2 charge) [312, 347, 53, 204]. The cation exchange capacity ultimately is a function of the accessibility of the negatively-charged binding sites by positively-charged ions [137]. Doping soil with nanotubes engineered to participate in cation exchange will increase the number of sites where cations bind. CEC modulation arises naturally in smectite clays such as Ben- tonite and Montmorillonite [150, 257, 172, 180]. Variable interlayer spacing only per- mits cations with a suciently small hydrated radius to enter the interlayer gallery and bind with the exchange sites in smectite clays. Hydrating the clay increases the inter- layer spacing so more cations participate in exchange thus creating a varying CEC that depends on the local humidity [137, 150]. Carbon nanotubes should specifically contribute to the baseline CEC. Some con- tributions to the exchange capacity depend on environmental conditions such as pH and hydration level [277, 101, 332, 178]. Carboxyl functional groups in soil organic matter liberate a greater fraction of their covalently bound protons with increasing pH [324, 22]. The vacancies leave a residual1 charge on the group so that it now par- ticipates in cation exchange. Organic matter contributions eventually decrease because organic materials degrade during continued exchange and through anaerobic bacterial- metabolism in the soil [17, 313, 20, 49, 187, 151]. Fallow fields regenerate some of this organic matter by decomposing plant roots that remain after harvest but this forces the soil to remain dormant and is therefore not always practiced [301, 115, 258, 342]. An increase in the baseline CEC should create soils that support continuous cultivation and are less sensitive to environmental conditions such as pH. 236 Suitably modified nanotubes should also increase the anion exchange capacity of soil. Anion exchange prevents leaching of highly soluble anions like NO 3 [98, 357]. The AEC measures the soil’s capacity to exchange anions however the anion exchange capacity is generally very low and is not easily modified [161, 179]. Examining the eect of carbon nanotubes on the soil’s AEC could result in advancements overlooked by current agricultural research. Experiment: Adding doped nanotubes to soil Direct CEC measurements of soils should provide evidence of whether nanotubes aect the soil CEC. The Soil Science Society of America recommends the BaCl 2 -compulsive exchange method to determine the CEC because it produces a highly repeatable ( 10%), precise ( 1 meq/100gm), and direct measurement [128, 292, 129]. We will purchase pristine chemical vapor deposited (CVD) single-walled carbon nanotubes through a commercial distributor such as MER Corp. In the first research thrust we will examine the eects of boron-doped nanotubes on the net CEC. Borowiak- Palen outline an eective method for doping carbon nanotubes with boron [30]. Boron in the carbon shell of CNTs produces the regions of electron proposed to increase the non-specific cation exchange. The setup consists of a high temperature furnace with a tube to deliver an NH 3 carrier gas to a mixture of 5:1 boron-oxide and SWCNT in a reaction crucible. Thin films are prepared on KBr single crystals for optical spectrum analysis. The doping introduces a new absorption peak at 0.4 eV due to the creation of a new valence band after the incorporation of boron. We will use this peak to confirm the presence of boron-doped nanotubes. We will also push the study of novel carbon nanotube functionalizations that assist cation exchange. More exotic methods of CNT functionalization should exploit the exchange of specific cations and promote the exchange of cationic plant micro-nutrients such as Molybdenum and Nickel. Schemes that attach novel alkyl-ligands through carboxyl functional defects present on the surface of prepared nanotubes show some promise. [285]. Fluorinated nanotubes [148] should also provide a possible reactant from which to base the functional synthesis. The Source Clay Minerals Repository (SCMR) at the University of Missouri (Columbia, MO) supplies a large selection of clay minerals. We will purchase clay samples from SCMR and adhere the nanotubes to the clay to prevent the CNT from 237 diusing or leaching and to prevent them from agglomerating [121, 136]. Hydrother- mal treatment of carbon nanotubes with Al 3+ -exchanged Montmorillonite clay and an aliphatic acid yields side alkylated carbon nanotube-clay composites as shown in Figure (B.1). The ester-CNT bond that tethers the nanotubes to the clay also reduces the risk of the nanotubes becoming airborne and being respired during the course of the experi- ment. FT-IR and XRD data will confirm the presence of the composites after thorough rinsing. DTA curves present further evidence of clay–nanotube amended clay directly. CEC controls consisting of untreated clay, clay treated with boron-CNT that are not adhered and clay conditioned with non boron-doped nanotubes will provide a measure of the tube eects on the CEC. These measurements will also begin to isolate properties of the CNT-clay mixture that modify the soils exchange capacity. Spectroscopic anal- ysis will follow on treatments that demonstrate the ability to alter soil CEC to refine the characterizations made above. Performing these assays at multiple pH levels will confirm a CEC benefit by carbon nanotubes over a range of agricultural conditions. Figure B.1: Carbon nanotubes are covalently linked through an ester bond to the Al 3+ - exchanged clay substrate under heat [121]. An increase in the baseline CEC should create soils that support continuous cultiva- tion and are less sensitive to environmental conditions such as pH and hydration level. Novel applications of nanotubes to the modulation of cation exchange of both macro- available and micro-available cations should yield soil conditioners able to compensate for a range of soil deficiencies. Extending the use of the CNT-compounds to a clay or polymer support should lead to new uses of modified industrial-clays in adsorption and catalysis. 238 B.0.2 Increased CEC benefits plants Raising cation exchange capacity with carbon nanotubes should benefit plants by assist- ing them with cation uptake during conditions of low nutrient availability. Plants face limited nutrient supply during hardship conditions and in non-fallow soil. Increasing the number of cation binding sites by adding carbon nanotubes should drive binding equa- tions towards cation binding—resulting in lower aqueous cation concentrations. Cations bound to the soil matrix cannot leach from the soil because the exchange sites sequester the ions until ion exchange releases them. This shift allows plant roots to more readily liberate cation nutrients using ion exchange. A series of encased greenhouse experiments will dope carbon nanotubes into soil seeded with Romaine Lettuce to test this hypothesis. ANOV A and statistical design techniques will quantify the nanotube eects on the lettuce. Confirmation of a positive nanotube eect on CEC should produce soils that increase plant fertility over a wide range of growing conditions. Low CEC soil produces nutrient deficiency in plants Plant nutrition consists of 20 minerals obtained from the soil and air [7, 167, 238]. Plants obtain essential macro-nutrient cations (potassium, calcium, and magnesium) and anions (sulfur and phosphorus) from the soil through ion exchange [57, 276, 58, 13, 170, 59]. Exchange also drives the regulation of micro-nutrient anions and cations [376, 116]. A complete soil analysis measures both the available nutrient content and total nutrient content in the soil in order to identify critical mineral deficiencies. Critical mineral deficiencies can lead to a range of symptoms including leaf burn, wilting, or bronzing, or to complete to crop decimation. Availability of nutrients varies between soils and so the beneficial amendments to one soil may not benefit another. Introducing nanotubes to increase the number of cation binding sites should reduce cation leaching in soils with low cation exchange capacity. Soils leach unbound nutrients though the water table [318, 88, 328, 90]. Leaching occurs primarily with minerals that are not favored during cation binding such as Ca 2+ and Mg 2+ because they are easily displaced into solution by more labile species such as sodium and potassium [186, 280, 79]. Ca 2+ and Mg 2+ leach from soils with a low CEC because they cannot compete for 239 the limited number of exchange sites due to their relatively large radii. Coulomb’s Law governs exchange through the following rules [251, 250, 171]: Ion charge Ions with higher valence exchange for those with lower valence ( Z ion ). Ion radius Ions with a smaller radius exchange for ions with a larger radius ( 1=r 2 ). Distance The site must be at or very near a matrix–aqueous interface for an exchange to occur ( 1=D 2 ). CNTs should reduce leaching and promote cation exchange Plant mineral uptake depends on mass action, diusion, and interception [183, 371, 253, 195, 169]. Mass action mediates uptake in fertile soils because the nutrient supply is in abundance and plant intake can not deplete the nutrient concentrations surrounding the root hairs. Cations can not oset concentration depressions that result from plant nutri- ent uptake surrounding the roots in nutrient starved soils [371]. Nutrient starved soils establish a concentration gradient near the root and cation supply to plants is limited by the diusion rate of individual ions. Plants face limited nutrient supply during hardship conditions and in non-fallow soil [143, 348, 320]. Nanotube enhanced soils should provide a more resilient defense against these eects. Cations form electrostatic bonds with negatively charged regions on the surface of clay and organic material. Equation (B.1) illustrates a simplified bind- ing scheme modeling Ca 2+ binding with a representative cation exchange site S— [45, 31]. In the soil this reaction occurs within a much more complex fabric because the binding activity of competing ion species and the exchange of cations by more labile species augment the competitive interaction. S —+Ca 2+ aq SCa 2+ (B.1) The aqueous cation concentration exists in equilibrium with the bound cation concen- tration according to Equation (B.2) where [X] denotes the thermodynamic activity of X. K bind eq = [SCa 2+ ] [S —] [Ca 2+ aq ] (B.2) 240 The concentration of Ca 2+ should be fixed during low nutrient availability because the diusion rate limits the rate of replenishment. Carbon nanotubes would increase the total number of cation binding sites to S T = S— + S NT — causing a shift in the binding equilibrium to the right. This pushes the equilibrium to favor cation binding and therefore reduces cation leaching. Plants release protons to invoke ion exchange [36, 207] per equation (B.3). The equilibrium coecient K exch eq in equation (B.4) mediates this exchange [329, 156]. SCa 2+ + 2H + aq S H + 2 +Ca 2+ aq (B.3) K exch eq = [S H + 2 ] [Ca 2+ aq ] [SCa 2+ ] [H + aq ] 2 (B.4) Earlier experiments have shown that the rate of soil cation exchange is generally rapid and requires only a few minutes for equilibrium [18]. The exchange is a surface reaction and proceeds just as fast as ions are supplied from the solution by diusion. Malcom showed that more than 75% of the exchange in clays occurs within 3 seconds and it completes within two minutes [229]. Nanotubes should shift equation (B.1) to the right and that should provide an addi- tional benefit to cation exchange in Eq B.3. Shifting the binding equilibrium to the right increases the number of S—Ca 2+ species and decreases [Ca 2+ aq ]. Both of these serve to shift the exchange equation to favor exchange of hydrogen for the cation. Plants directly benefit from this shift because the increased cation exchange capacity makes nutrient cations available during times of scarce availability. Nanotubes therefore should pro- duce two complementary benefits: 1. Nanotubes should produce a greater number of bound cations that are immune to leaching through the water table 2. Nanotubes should elicit an increase in cation exchange by plants. Preliminary Experiment We recently performed a preliminary small-scale growth experiment that produced inconclusive results. We applied multi-walled CNT to the soil without modifying the 241 CNTs. We believe that plants gained minimal benefit (if any) because the hydropho- bic multi-walled carbon nanotubes interacted poorly with the aqueous soil. Untreated nanotubes also do not show a significant anity for charge. Drainage and fertilizing pre- sented further problems. We have designed the proposed CEC experiments to overcome these earlier diculties (discussion below). Experiment: Nanotubes should increase plant growth Modeling simulations will study the flow of cations between free and bound states in the presence of diusion-limited and abundant cation availability. They will also provide a framework to confirm the binding and exchange rates in nutrient species. We need to examine the binding and exchange rates for micro-nutrient ions especially because if these ions do not bind eciently they may not adequately compete for binding locations due to their low concentration. The simulations will provide crucial theoretical support for the primary hypothesis. Simulations will precede the growth experiments. The simulations will occur in parallel with the soil doping experiment. We will carry out a series of greenhouse experiments to test the hypothesis whether carbon nanotubes increase plant growth. Romaine Lettuce will grow in climate con- trolled greenhouses following standard agricultural practices. Romaine lettuce is a standard plant model because of its nutrient sensitivity and its fast maturation time of approximately 75 days. We will first determine whether there is any aect on the net CEC measure due to the addition of the boron-doped nanotubes. Increasing the specific cation capacities of Ca 2+ and Mg 2+ will be the next major research thrust. Ca 2+ and Mg 2+ provide good mineral candidates to focus on because they are prone to leaching and because their relatively high soil concentration reduces the specificity that the modified nanotubes must exhibit to benefit cation exchange. A third research thrust will focus on the more challenging task of increasing specific exchange capacities of mineral micro-nutrients and anions. ANOV A statistical procedures will measure whether a dierence exists between let- tuce populations grown in soil doped with nanotubes and both positive and negative controls [352, 366]. We will use ANOV A at the conclusion of each greenhouse trial to test the central hypothesis. We will also analyze the data from all of the experimental trials after the heteroscedasticity of the respective blocks across trials is accounted for. 242 Blocking of the experimental units will give the data we need to detect a correlation between CNT concentrations and plant growth [222]. Our hypothesis implies that nanotube treated soil will have more cations available for exchange and that adding vegetation will not reduce the cation availability as much as it will reduce the cation availability in control soils without nanotubes. We will test this by measuring the soil CEC before and after each run of the experiment. In the initial trials the total quantities of calcium, magnesium, and potassium in the soil will provide detailed information on the benefits that nanotubes confer to the exchange capacities of specific cations. Later trials will measure the total and available micro-nutrients and focus on the cation-specific eects that nanotubes contribute to the CEC. Fume-hoods and greenhouses minimize CNT contact We have designed the experiment to minimize CNT exposure risks. The unique chem- istry of nanoparticles raises questions about their toxicology. Breathing in airborne nan- otubes may pose a health hazard. A recent study [356] found that harmful exposure to respirable-sized carbon nanotubes is extremely rare because of the nanotubes elec- trostatic nature and because of their tendency to agglomerate into nanorope structures [356, 233]. Still we will take all reasonable precautions to limit both our airborne and direct exposure to the nanotubes. Reactions will proceed in a chemical fume-hood to minimize exposure to harmful intermediates and airborne nanoparticles. We will tether nanotubes to the soil to prevent CNT aerosol formation. The lettuce will grow in a sealed greenhouse to minimize any possibility of human contact. B.0.3 Corollary Hypothesis: CNTs should benefit industrial clays A corollary hypothesis to the plant–CNT hypothesis is that increased cation exchange capacity should also benefit industrial clays after CNT treatment. Nanometer catalysts display much greater activity and selectivity than traditional catalysts because of their unique structures and surface reactivities. Treated clays could provide a scaold for functionalized carbon nanotubes engineered to perform a specific catalysis or adsorp- tion. Clay beads covered with functionalized nanotube “fingers” would possess a large catalysis area and could promote prescribed reactions with high ecacy. Increasing the 243 exchange capacity of Bentonite clays by doping them with nanotubes should also cre- ate suitable candidates for resins to remove heavy metals from waste waters (Cr(III), Ni(II), Zn(II), Cu(II), Cd(II)) because CEC relates significantly to heavy metal adsorp- tion. Demonstrating these eects would advance the field of nano-catalysis and lead to more ecient materials for heavy ion removal and cation adsorption. Nano-engineered clays should catalyze novel reactions Results leading to the creation of novel nano-engineered catalysts would advance the field of catalysis. Nanometer catalysts have a much greater activity and selectivity than traditional catalysts because of their unique structures and surface reactivities [41]. Cat- alyst methods already take advantage of clays in catalysis and surface chemistry by using the inherent negative charge of clay to suspend a reactions semi-stable intermedi- ate products—increasing the formation of products [219, 215, 279, 29, 350]. Increasing the clays cation exchange capacity should benefit the catalysis of many reactions. The nature of most catalysis requires a high degree of specificity however. Clay beads cov- ered with functionalized nanotube “fingers” should possess novel reactivities and could promote reactions with high specificity. The large surface area of carbon nanotubes should also allow a combination of interacting or independent functional groups to fur- ther drive catalysis. CEC increases heavy metal removal Chromium(III), Nickel(II), Zinc(II), Copper(II), and Cadmium(II) cations contami- nate runo from industrial production facilities and power plants [256, 309, 77, 232]. Removal of these toxic metals poses a dicult challenge to adsorption research. Cur- rent methods involve precipitation of metal-ions by chelatation using synthetic resins such as Amberlite IRC-718, Amberlite 200, Duolite GT-73 and carboxymethylcellu- lose (CMC) [308, 214]. The synthetic resins work very well for a few “target” metallic species but do not perform equally well across all metals [35, 334]. Industrial clays pre- pared with increased CEC after nanotube doping should in eect sequester heavy met- als because heavy metal adsorption relates significantly to CEC [311]. Bentonite clays provide an ideal model for catalytic clays because of its ability to swell in water thus greatly increasing its surface area [3, 40, 174]. Doping these clays with cation adsorb- ing carbon nanotubes should provide a scaold that increases the adsorptive capacity. 244 High concentration ions will swamp CNT engineered for general cation exchange how- ever. Removing heavy metals from water requires tuning CNT reactivity to exploit the exchange of specific cation. A CNT shell covered with chelating groups each specific for dierent toxic metal ions should yield a very eective removal compound. This inspiration for heavy-metal removal draws upon the Troika-acid concept where a single molecular unit is engineered to provide many arms [177, 42, 176]. Functionalizing nan- otubes with cation-specific adsorptive moieties should create eective compounds for metal-ion removal. 245
Abstract (if available)
Abstract
Noise can improve Markov chain Monte Carlo (MCMC) estimation. This thesis shows that MCMC can exploit noise benefits to improve estimator performance and speed‐up convergence. Modern computation predicates itself on solutions to the problem: how does one efficiently search a complex high dimensional space. MCMC proposes a statistical answer by considering the reverse question: assuming the solution how can one reach it given any starting point. The success of MCMC turns on the thermodynamic principle called detailed balance. Detailed balance allows the chain to act interchangeably in the forward or backward direction. The Metropolis Hastings algorithm, the Gibbs sampler, and simulated annealing are special cases of random walk MCMC algorithms. MCMC promises some of the most efficient solutions to NP problems. This thesis presents three major theoretical results. The Noisy Markov Chain Monte Carlo Theorem shows that injected noise can lead to better MCMC sampling and reduce burn‐in time. The related Markov Chain Noise Benefit Theorem describes a noise benefit criterion that applies generally to all Markov chains.The noise gives the system access to a statistically richer set of otherwise improbable states. The related Markov Chain Noise Benefit Theorem describes a noise benefit criterion that applies generally to all Markov chains. The Noisy Simulated Annealing Theorem shows that noise can also boost MCMC optimization. Simulated annealing (SA) introduces a notion of temperature to constrain estimates within thermally optimal regions of the potential energy surface. This thesis describes noisy versions of the Metropolis Hastings algorithm and classical simulated annealing. This thesis also presents the noise boosted quantum annealing algorithm. Quantum annealing (QA) replaces the temperature in classical simulated annealing with probabilistic quantum tunneling. QA uses tunneling to escape local minima by burrowing through high energy peaks. The noisy quantum annealing algorithm uses noise to improve ground‐state energy calculations on high‐dimensional energy surfaces. This thesis closes with a small‐sample bootstrap algorithm that can estimate the tail thickness parameter for alpha‐stable bell curves. This thesis shows that symmetric alpha‐stable noise can lead to substantial MCMC performance benefits. This suggests MCMC algorithms could tune noise tail thickness to further enhance the noise benefit.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Noise benefits in expectation-maximization algorithms
PDF
Noise benefits in nonlinear signal processing
PDF
Stochastic models: simulation and heavy traffic analysis
PDF
Precision-based sample size reduction for Bayesian experimentation using Markov chain simulation
PDF
Matrix factorization for noise-robust representation of speech data
PDF
High-capacity feedback neural networks
PDF
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
PDF
Explorations in the use of quantum annealers for machine learning
PDF
Active state tracking in heterogeneous sensor networks
PDF
Advancing the state of the art in quantum many-body physics simulations: Permutation Matrix Representation Quantum Monte Carlo and its Applications
PDF
Kinetic Monte Carlo simulations for electron transport in redox proteins: from single cytochromes to redox networks
PDF
Information geometry of annealing paths for inference and estimation
PDF
Localization of multiple targets in multi-path environnents
Asset Metadata
Creator
Franzke, Brandon
(author)
Core Title
Noise benefits in Markov chain Monte Carlo computation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/06/2015
Defense Date
10/21/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alpha‐stable distribution,Bayesian statistics,Markov chain,Markov chain Monte Carlo,MC,MCMC,Monte Carlo random sampling,noise benefits,OAI-PMH Harvest,QA,quantum annealing,SA,simulated annealing,SR,stochastic resonance
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kosko, Bart (
committee chair
), Ortega, Antonio (
committee member
), Ross, Sheldon (
committee member
)
Creator Email
brandon@bfranzke.com,franzke@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-196354
Unique identifier
UC11277203
Identifier
etd-FranzkeBra-4017.pdf (filename),usctheses-c40-196354 (legacy record id)
Legacy Identifier
etd-FranzkeBra-4017.pdf
Dmrecord
196354
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Franzke, Brandon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alpha‐stable distribution
Bayesian statistics
Markov chain
Markov chain Monte Carlo
MC
MCMC
Monte Carlo random sampling
noise benefits
quantum annealing
SA
simulated annealing
SR
stochastic resonance